wiki:DevelopmentActivities/Branches/ORCHIDEE-MICT-IMBALANCE-P/SimulationTimes

Version 134 (modified by ajornet, 4 years ago) (diff)

--

Performance

Basic Performance Report

Overview

This document tries to understand Orchidee MICT computing time behavior. In the latest version 6.5 it takes a lot of time to compute. Around 8h in 0.5 degrees for 1 year. So it is necessary to understand why It happens. Once the issues are identified it might be possible to apply different solutions.

In order to make such thing possible the code is profiled. Different tools are used (vtune, vampir, gprof, ...). They provide an easy way to identify basic hotspots in the code.

Report

attachment:performance_mict_albert_jornet_150616.pdf

MICT V6 (3344 + PFT interpolation) Module computing time

Starting from a basic configuration. At each test a new module is activated. This increases the numbers of modules at each test. Its purposes is to show the impact of each module when is used.

Perf_MICT_options

Trunk vs MICT Comparision 11/04/2016

  • Date 11/04/2016
  • ADA Machine
  • IOIPSL production mode
  • Orchidee production mode
  • 1Y
  • 16 cores
  • Forcing:
    • 1 Degree
    • 3H

Considerations:

  • MICT is in the same level of modifications as Trunk revision 3346
  • MICT is using parallel interpolation for aggregate 2D subroutine

Overview

Orchidee vs trunk profiling

Subroutines are placed in 4 different groups described below:

  • ioipsl: all subroutines related to IOIPSL library
  • Top orchidee: subroutines >1% of computing time
  • Interpolation: interpolation time by aggregate_2D subroutine
  • other orchidee: remaining subroutines from orchidee

Mict R3359 (gprof)

This is a profiling test done with gprof tool:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ks/call  Ks/call  name    
 25.66   1383.92  1383.92  2245127     0.00     0.00  mathelp_mp_ma_fuscat_r21_
  9.62   1902.84   518.92  3835809     0.00     0.00  mathelp_mp_moycum_index_
  9.18   2398.02   495.18  3835826     0.00     0.00  histcom_mp_histwrite_real_
  5.96   2719.41   321.39    17524     0.00     0.00  thermosoil_mp_thermosoil_cond_pft_
  3.81   2924.90   205.49    17520     0.00     0.00  hydrol_mp_hydrol_soil_
  3.62   3119.87   194.97   420480     0.00     0.00  hydrol_mp_hydrol_soil_coef_
  3.59   3313.39   193.52    17524     0.00     0.00  thermosoil_mp_thermosoil_getdiff_
  3.11   3481.04   167.65      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet2_mp_ch4_wet_flux_density_wet2_
  3.05   3645.33   164.29      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet1_mp_ch4_wet_flux_density_wet1_
  2.92   3803.03   157.70      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet3_mp_ch4_wet_flux_density_wet3_
  2.86   3957.34   154.31      365     0.00     0.00  stomate_wet_ch4_pt_ter_0_mp_ch4_wet_flux_density_0_
  2.74   4105.24   147.90      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet4_mp_ch4_wet_flux_density_wet4_
  2.67   4249.50   144.26    17522     0.00     0.00  thermosoil_mp_thermosoil_coef_
  1.63   4337.37    87.87    17520     0.00     0.00  hydrol_mp_hydrol_diag_soil_
  1.59   4423.39    86.02  2666157     0.00     0.00  mod_orchidee_omp_transfert_mp_gather_omp_r1_
  1.57   4507.82    84.43       55     0.00     0.00  interpol_help_mp_aggregate_2d_
  1.37   4581.90    74.08    17520     0.00     0.00  diffuco_mp_diffuco_trans_co2_
  1.36   4655.06    73.16    17520     0.00     0.00  stomate_mp_stomate_main_
  1.22   4720.59    65.53    17520     0.00     0.00  stomate_permafrost_soilcarbon_mp_microactem_
  1.06   4777.86    57.27    17520     0.00     0.00  hydrol_mp_hydrol_main_
  0.96   4829.85    51.99  1602027     0.00     0.00  mathelp_mp_ma_fuscat_r11_
  0.77   4871.20    41.35    17522     0.00     0.00  thermosoil_mp_thermosoil_readjust_
  0.74   4911.35    40.15  2664512     0.00     0.00  mod_orchidee_omp_transfert_mp_gather_omp_i1_

Total Simulation time: 5358 seconds

IO: mathelp + histcom = 25.66 + 9.62 + 9.18 = ~45%

Trunk R3346

This is a profiling test done with gprof tool:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ks/call  Ks/call  name    
 22.26    441.54   441.54        7     0.06     0.06  interpol_help_mp_aggregate_2d_
 14.52    729.66   288.12  2171415     0.00     0.00  histcom_mp_histwrite_real_
 13.26    992.66   263.00    17520     0.00     0.00  hydrol_mp_hydrol_soil_
 10.28   1196.56   203.90   773813     0.00     0.00  mathelp_mp_ma_fuscat_r21_
  5.07   1297.17   100.61  2171397     0.00     0.00  mathelp_mp_moycum_index_
  4.16   1379.77    82.60    17520     0.00     0.00  diffuco_mp_diffuco_trans_co2_
  3.81   1455.34    75.57    17520     0.00     0.00  hydrol_mp_hydrol_diag_soil_
  3.67   1528.21    72.87   157680     0.00     0.00  hydrol_mp_hydrol_soil_coef_
  2.29   1573.69    45.48  1400412     0.00     0.00  mathelp_mp_ma_fuscat_r11_
  2.27   1618.76    45.07    17520     0.00     0.00  hydrol_mp_hydrol_main_
  1.86   1655.66    36.90    17521     0.00     0.00  thermosoil_mp_thermosoil_getdiff_
  1.46   1684.63    28.97    17521     0.00     0.00  thermosoil_mp_thermosoil_humlev_
  0.99   1704.17    19.54   157680     0.00     0.00  hydrol_mp_hydrol_soil_tridiag_
  0.94   1722.82    18.65    17520     0.00     0.00  stomate_litter_mp_littercalc_
  0.92   1740.99    18.17    17520     0.00     0.00  hydrol_mp_hydrol_split_soil_
  0.86   1758.10    17.11    17520     0.00     0.00  stomate_mp_stomate_main_
  0.81   1774.07    15.98  1133588     0.00     0.00  mod_orchidee_omp_transfert_mp_gather_omp_r1_

Total Simulation time: 1956 seconds

IO: mathelp + histcom = 14.25 + 10.28 + 5.07 = ~30%

Trunk vs MICT Comparision 18/02/2016

18/02/2016: revisions trunk 2916 and MICT 3161 were considered to be equivalents.

The same run.def file is used to compare both developments.

The simulations were carried out under the following conditions:

  • 1 Year
  • Global
  • CRU-NCEP v5.3.2 (6 hourly)
  • CURIE
  • IO library: IOIPSL/XIOS
    • Yearly output
  • Compilation mode IOIPSL: production
  • Compilation mode Orchidee: production
  • Compilation mode XIOS: production
    • e.g: 64 cores = 64 ORC + 1 XIOS

trunk_vs_mict_performance

Configurations

  • S0: no freeze + no explicitsnow + no ok_pc + no hydrol_cwrr
    • Used by default
  • S1: S0 + freeze + explicitsnow + ok_pc + hydrol_cwrr
  • S2: S1 + ch4_calcul
  • S3: S2 + dgvm

Orchidee-CN-P R4758

N procs 8 16 32 64 128 256 512 1024
0.5 deg
1 deg
2 deg

MICT R4755 + IOIPSL align (S1)

N procs 8 16 32 64 128 256 512 1024
0.5 deg 2h45 1h22 43m52 25m25 19m05 19m49
1 deg 1h23 39m57 20m59 13m28 10m29 9m07 -
2 deg 43m33 21m11 11m24 6m53 4m46 4m03 4m20 -

MICT R4414 + IOIPSL align (S0)

N procs 8 16 32 64 128 256 512 1024
0.5 deg 11h50 5h29 2h33 1h14 37m02 25m 12m52 14m43
1 deg 2h41 1h17 38m41 18m44 10m32 9m
2 deg 40m 19m03 10m17 5m32 4m33 2m42

MICT R4385 + IOIPSL align (S2)

  • Standard
N procs 8 16 32 64 128 256 512 1024
0.5 deg 42m40
  • SOA ch4
N procs 8 16 32 64 128 256 512 1024
0.5 deg 41m34

MICT R4385 + IOIPSL align (S1)

  • No restarts (unlimited IOIPSL):
N procs 8 16 32 64 128 256 512 1024
0.5 deg 12h17 5h42 2h41 1h17 38m12 21m39 13m52 14m58
1.0 deg 2h47 1h20 38m16 19m25 10m35 6m58 - -
2.0 deg 42m 20m 10m29 6m12 4m32 3m03 - -
  • No restarts (limited IOIPSL): All running (_latest)
N procs 8 16 32 64 128 256 512 1024
0.5 deg c c 2h39 1h17 37m58 21m02 13m49 14m59
  • With restarts (unlimited IOIPSL): All running (_latest_next)
N procs 8 16 32 64 128 256 512 1024
0.5 deg c c 2h46 1h24 43m51 31m 23m11 22m32
  • With restarts (limited IOIPSL): (_latest_limitedio)
N procs 8 16 32 64 128 256 512 1024
0.5 deg 36m26 20m21 12m52 14m45
  • No restarts (limited IOIPSL): thermosoil_cond_pft + no precise (_latest_refactor)

thermosoil_cond_pft is refactored again. Performance modifications were lost in previous commits. Refactorization + no procise allows the vectorization of pow and exp subroutines.

N procs 8 16 32 64 128 256 512 1024
0.5 deg 32m23

MICT R4289 + IOIPSL align (S1)

N procs 8 16 32 64 128 256 512
0.5 deg 13h 6h11 2h50 1h23 42m06 23m47 15m42
1.0 deg 2h57 1h26 42m13 20m18 10m51 7m19 -
2.0 deg 44m58 21m21 10m54 6m17 4m 2m56 -

MICT R4277 + interpolation (S1)

N procs 8 16 32 64 128 256 512
0.5 deg 16h39 8h05 3h38 1h42 50m12 26m13 16m47
1.0 deg 3h45 1h44 50m16 22m29 13m11 7m29
2.0 deg 55m12 25m09 12m35 7m13 4m27 3m12

Commit in [4289/branches/ORCHIDEE-MICT/ORCHIDEE]

MICT R4277 + thermosoil refactor + IOIPSL alignment (S1)

N procs 8 16 32 64 128 256 512
0.5 deg 13h42 6h28 2h59 1h28 44m14 24m48 16m35
1.0 deg 3h25 1h29 42m57 21m20 11m24 7m13
2.0 deg 45m 22m48 11m42 7m23 4m15 3m05

MICT R4277 + IOIPSL aligment (S1)

N procs 8 16 32 64 128 256 512
0.5 deg 16h39 8h46 3h 1h46(2nd try) 52m 27m14 17m19
1.0 deg 4h10 1h48(2nd try) 52m51 24m21 12m37 7m40 -
2.0 deg 55m 26m20 13m04 6m46 4m28 3m11 -

MICT R4277 + thermosoil refactor (S1)

N procs 8 16 32 64 128 256 512
0.5 deg 13h38 6h28 3h 1h29 48m16 28m52 20m20
1.0 deg 3h08 1h31 44m56 23m28 14m37 10m18
2.0 deg 49m46 25m35 14m55 10m11 7m59 6m57

Commited in [4280/branches/ORCHIDEE-MICT/ORCHIDEE]

MICT R4277 (S1)

N procs 8 16 32 64 128 256 512
0.5 deg 16h39 8h21 3h43 1h47 55m34 31m18 21m28
1.0 deg 3h27 1h48 53m37 26m21 15m38 10m46 -
2.0 deg 59m17 28m31 16m05 10m42 8m13 7m03 -

MICT R4277 + thermosoil refactor (S0)

N procs 8 16 32 64 128 256 512
0.5 deg 7h42 3h18 1h33 47m07 24m48 15m53
1.0 deg 3h27 1h36 47m 21m38 13m33 9m14
2.0 deg 51m38 24m19 14m08 8m56 5m16 4m18

Commited in [4280/branches/ORCHIDEE-MICT/ORCHIDEE]

MICT R4277 (S0)

N procs 8 16 32 64 128 256 512
0.5 deg 7h33 3h15 1h32 49m49 24m43 15m47
1.0 deg 3h26 1h36 46m05 21m18 13m25 9m05
2.0 deg 51m07 24m06 14m06 8m56 5m12 4m09

MICT R4274 (S0)

N procs 8 16 32 64 128 256 512
0.5 deg - 7h33 3h15 1h32 46m49 24m43 15m47
1.0 deg 3h26 1h36 46m05 21m18 13m25 9m05
2.0 deg 51m07 24m06 14m04 8m56 5m12 4m09

Trunk R3934 (XIOS 2 + S0)

N procs 8 16 32 64 128 256
0.5 deg 3h23 2h01 1h19 1h01 53m18 48m40
1.0 deg 52m07 32m10 21m38 17m18 14m52 13m56
2.0 deg 14m50 9m11 6m38 5m39 5m15 5m13

Notes:

  • Interpolation is sequential

Mict R3932 (XIOS 2) + CROP + IOIPSL restarts

This is an early test with IOIPSL + restarts to 3, 4 and 5 dimensions. This revision is still in a perso directory. It includes remaining revisions from TRUNK. It will be merge to the main MICT branch any time soon.

Its purpose is to provide a first draft of this modification.

N procs 8 16 32 64 128 256
0.5 deg 15h54 7h50 3h42 2h06 55m01 34m35
  • 0.5: The number of XIOS outputs is reduced so the simulation can finish.

Mict R3932 (XIOS 2) + IOIPSL restarts

This is an early test with IOIPSL + restarts to 3, 4 and 5 dimensions. This revision is still in a perso directory. It includes remaining revisions from TRUNK. It will be merge to the main MICT branch any time soon.

Its purpose is to provide a first draft of this modification.

N procs 8 16 32 64 128 256
0.5 deg Out of mem. 5h31 2h40 1h24 44m32 23m15
1 deg 2h44 1h21 39m24 19m15 10m44 7m21
2 deg 43m41 20m37 10m58 6m23 4m15 3m24
  • 0.5: The number of XIOS outputs is reduced so the simulation can finish.

Mict R3811 (XIOS 2)

N procs 8 16 32 64 128 256
0.5 deg out of memory out of memory out of memory out of memory 1h33 1h27
1 deg 2h48 1h27 44m24 25m21 19m52 13m47
2 deg 43m48 21m31 11m13 7m40 5m16 4m30

Output Netcdf files:

  • 0.5 Degree
Filename Size # vars
stomate_rest_out.nc 14G 379 (double)
sechiba_rest_out.nc 8.5G 234 (double)
driver_rest_out.nc 28M 13 (double)
sechiba_history.nc 3.0G 114 (float)
stomate_history.nc 3.3G 297 (float)

Changes

  • CROP restart variables are now only active when CROP is enabled.
  • XIOS history outputs now include 4D/5D dimension. It allows to reduce the number of variables in the outputs.

Issues

  • 0.5 deg Out of memory is due to XIOS

Conclusion

  • 0.5deg - 64 procs: it might be due to 4D/5D variables.
  • 0.5deg computing time: less restart variables to write decreased total time. CROP module still has this problem.

Mict R3791 (XIOS 2)

  • Date: 30/09/16
  • Add all XIOS output fields

Time table:

N procs 8 16 32 64 128 256
0.5 deg out of memory out of memory out of memory 14h56 15h42 11h06
1 deg 3h19 2h28 1h52 2h22 1h32 1h23
2 deg 54m38 38m53 23m58 24m43 24m37 16m11

Output Netcdf files:

  • 0.5 Degree
Filename Size # vars
stomate_rest_out.nc 20G 611 (double)
sechiba_rest_out.nc 8.6G 234 (double)
driver_rest_out.nc 28M 13 (double)
sechiba_history.nc 2.0G 388 (float)
stomate_history.nc 3.8G 1179 (float)

Changes

  • Add CROP module.

Issues

  • 0.5 deg Memory requirements are high
  • 0.5 deg Simulation Time is far too high. Even when the module is disabled.

Conclusion

  • 0.5deg Memory: the introduction of XIOS increases the memory usage
  • 0.5deg simulation time: a lots of more restart variables to write

Mict R3587 (XIOS 2!)

  • Small fixes
  • Trunk update

Time table:

N procs 8 16 32 64 128 256
0.5 deg out of memory 4h43 2h21 1h18 58 47
1 deg running 1h05 35 23 16 18
2 deg - - - - - -

Mict R3587 (XIOS 2 + thermosoil_cond_pft)

This specific branch involves the subroutine thermosoil_cond_pft. It is shown in some profiling reports to be highly consuming. The next tests are an effort to improve the performance.

All tests are done with 0.5 degres. All other parameters are the same specified in this section.

N procs 32 64 128 256
Avx + align 32 + vecalign32 2h08 1h16 49 40
Align 32 + vecalign32 2h13 1h30 54 43
Avx + align 32 2h12 1h18 1h02 47
Align 16 2h23 1h13 1h04 50

Description:

  • avx: 256 bit register
  • align 32: -align array32byte compilation flag
  • align 16: -align array16byte compilation flag
  • vecalign: source code lines to help the compiler improve the performance

Mict R3567

  • New driver

Time table:

N procs 8 16 32 64 128 256
0.5 deg timeout 11h22 7h21 4h51 3h38 2h49
1 deg 4h22 2h21 1h24 54 39 30
2 deg 58 31 19 13 9 7

Mict R3527

  • PFT parallel interpolation

Time table:

N procs 8 16 32 64 128 256
0.5 deg >16h39 322 days 11h09 7h06 4h50 3h31 2h47
1 deg 4h10 2h14 1h20 52 37 30
2 deg 55m05 30m01 18m05 12 9 7

Mict R3161

Time table:

N procs 4 8 16 32 64 128
0.5 deg timeout timeout 13h00 8h46 6h35 5h38
1 deg 6h37 4h20 2h36 1h45 1h21 1h08
2 deg 1h40 56 35 24 19 16

Note: 0.5 deg in 4 N procs did not start due to memory requirements. 0.5 deg in 8 N procs could not finish the simulation in the maximum time given by the HPC. It stopped at the simulation day 322. Both values can be extrapolated.

Trunk R2916

The same simulations with the same options where carried out with the following results:

N procs 4 8 16 32 64 128
0.5 deg 8h38 5h31 3h26 2h23 1h48 1h31
1 deg 2h07 1h17 47 32 25 21
2 deg 38 19 11 8 6 5

IOIPSL

Restart File Creation :

Attachments (12)