Changes between Version 1 and Version 2 of Doc/env/TgccCurie


Ignore:
Timestamp:
03/24/14 16:16:53 (10 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Doc/env/TgccCurie

    v1 v2  
     1{{{ 
     2#!html 
     3<h1>Working on the curie machine </h1> 
     4}}} 
     5---- 
     6[[PageOutline(1-3,Index du chapitre,,numbered)]] 
     7 
     8 
     9# Online users manual # 
     10 * The command `curie.info` returns all useful information on the curie machine. Keep it in mind and use it often. 
     11 * The TGCC's storage spaces are visible from the curie machine: `$CCCWORKDIR` and `$CCCSTOREDIR` 
     12 * The `$SCRATCHDIR` space only exists for the curie machine. Be careful, this space is often cleaned and only files that are less than 40 days are stored.  
     13You will find the users manual provided by TGCC [https://www-tgcc.ccc.cea.fr here] : provide your TGCC/CCRT login and password in the tab for TGCC. 
     14 
     15# Job manager commands # 
     16 * {{{ccc_msub mon_job}}} -> submit a job  
     17 * {{{ccc_mdel ID}}} -> kill the job with the specified ID number  
     18 * {{{ccc_mstat -u login}}} -> display all jobs submitted by login 
     19 * {{{ccc_mpp}}} -> display all jobs submitted on the machine. {{{ ccc_mpp -n }}} to avoid colors. 
     20 * {{{ ccc_mpp  -u $(whoami)}}} ->display your jobs. 
     21 
     22# Before starting a job # 
     23 
     24## Specify the project name ## 
     25 
     26Since January 2013, you must specify in the header from which project you will use computing time: 
     27{{{ 
     28#MSUB -A genxxx 
     29}}} 
     30 
     31## QoS test  ## 
     32QoS (Quality of Service) is a test queue. You can have a maximum of 2 jobs in test queue, each of them is limited to 30min and 8 nodes (= 256tasks). In the job header you must add: 
     33{{{ 
     34#MSUB -Q test 
     35}}} 
     36and change the CPU time limit  
     37{{{ 
     38#MSUB -T 1800   
     39}}} 
     40 
     41# Other job manager commands # 
     42 * {{{ccc_mpeek ID}}} -> display the output listing of a job. Note that the job outputs are visible while the job is running. 
     43 * {{{ccc_mpinfo}}} to find out about the classes status and about the computing requirements of the associated processors. For example (11/26/2012) : 
     44{{{ 
     45/usr/bin/ccc_mpinfo   
     46                      --------------CPUS------------  -------------NODES------------ 
     47PARTITION    STATUS   TOTAL   DOWN    USED    FREE    TOTAL   DOWN    USED    FREE     MpC  CpN SpN CpS TpC 
     48---------    ------   ------  ------  ------  ------  ------  ------  ------  ------   ---- --- --- --- --- 
     49standard     up        80368      32   77083    3253    5023       1    4824     198   4000  16   2   8   1 
     50xlarge       up        10112     128    1546    8438      79       1      14      64   4000  128  16   8   1 
     51hybrid       up         1144       0     264     880     143       0      33     110   2900   8   2   4   1 
     52}}} 
     53 * detail of a running job. One command per line ccc_mprun : 
     54{{{ 
     55ccc_mstat -H 375309 
     56  JobID    JobName Partitio ReqCPU            Account               Start  Timelimit    Elapsed      State ExitCode  
     57------- ---------- -------- ------ ------------------ ------------------- ---------- ---------- ---------- --------  
     58 375309 v3.histor+ standard      0   gen0826@standard 2012-05-11T16:27:53 1-00:00:00   01:49:03    RUNNING      0:0  
     59375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T16:28:16              00:14:19  COMPLETED      0:0  
     60375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T16:42:47              00:12:54  COMPLETED      0:0  
     61375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T16:55:59              00:13:30  COMPLETED      0:0  
     62375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T17:09:31              00:13:22  COMPLETED      0:0  
     63375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T17:24:06              00:13:36  COMPLETED      0:0  
     64375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T17:37:54              00:13:31  COMPLETED      0:0  
     65375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T17:51:28              00:14:19  COMPLETED      0:0  
     66375309+ p86maf_ru+              32   gen0826@standard 2012-05-11T18:05:57              00:10:59    RUNNING      0:0  
     67}}} 
     68 * information about the error code of jobs: `ccc_macct nqsid` 
     69   * this job ran successfully: 
     70{{{ 
     71> ccc_macct 698214 
     72Jobid     : 698214 
     73Jobname   : v5.historicalCMR4.452 
     74User      : p86maf 
     75Account   : gen2211@s+ 
     76Limits    : time = 1-00:00:00 , memory/task = Unknown 
     77Date      : submit=06/09/2012 17:51:56, start=06/09/2012 17:51:57 , end= 07/09/2012 02:20:28 
     78Execution : partition = standard , QoS = normal 
     79Resources : ncpus = 53 , nnodes = 4 
     80   Nodes=curie[2166,5964,6002,6176] 
     81 
     82Memory /step 
     83------ 
     84                        Resident (Mo)                          Virtual (Go) 
     85JobID            Max (Node:Task)         AveTask       Max (Node:Task)          AveTask 
     86-----------    ------------------------  -------    --------------------------  ------- 
     87698214             0(            :   0)       0    0.00(            :   0)    0.00 
     88698214.batch      25(curie2166   :   0)       0    0.00(curie2166   :   0)    0.00 
     89698214.0         952(curie2166   :   0)       0    3.00(curie2166   :   1)    0.00 
     90... 
     91698214.23        952(curie2166   :   0)       0    3.00(curie2166   :   2)    0.00 
     92 
     93Accounting / step 
     94------------------ 
     95 
     96       JobID      JobName   Ncpus Nnodes  Ntasks      Elapsed        State ExitCode 
     97------------ ------------  ------ ------ ------- ------------   ---------- ------- 
     98698214       v5.historic+      53      4             08:28:31    COMPLETED    0:0  
     99698214.batch        batch       1      1       1     08:28:31    COMPLETED         
     100698214.0     p86maf_run_+      53      4      53     00:20:53    COMPLETED         
     101698214.1     p86maf_run_+      53      4      53     00:20:20    COMPLETED         
     102...       
     103698214.23    p86maf_run_+      53      4      53     00:21:06    COMPLETED  
     104}}} 
     105   * this job failed with an error code:  
     106{{{ 
     107 > ccc_macct 680580 
     108Jobid     : 680580 
     109Jobname   : v5.historicalCMR4 
     110User      : p86maf 
     111Account   : gen2211@s+ 
     112Limits    : time = 1-00:00:00 , memory/task = Unknown 
     113Date      : submit=30/08/2012 17:10:06, start=01/09/2012 04:11:30 , end= 01/09/2012 04:42:48 
     114Execution : partition = standard , QoS = normal 
     115Resources : ncpus = 53 , nnodes = 5 
     116   Nodes=curie[2097,2107,4970,5413,5855] 
     117 
     118Memory /step 
     119------ 
     120                        Resident (Mo)                          Virtual (Go) 
     121JobID            Max (Node:Task)         AveTask       Max (Node:Task)          AveTask 
     122-----------    ------------------------  -------    --------------------------  ------- 
     123680580             0(            :   0)       0    0.00(            :   0)    0.00 
     124680580.batch      28(curie2097   :   0)       0    0.00(curie2097   :   0)    0.00 
     125680580.0         952(curie2097   :   0)       0    3.00(curie2097   :   1)    0.00 
     126680580.1         316(curie2097   :   8)       0    2.00(curie2097   :   8)    0.00 
     127 
     128Accounting / step 
     129------------------ 
     130 
     131       JobID      JobName   Ncpus Nnodes  Ntasks      Elapsed        State ExitCode 
     132------------ ------------  ------ ------ ------- ------------   ---------- ------- 
     133680580       v5.historic+      53      5             00:31:18    COMPLETED    0:9  
     134680580.batch        batch       1      1       1     00:31:18    COMPLETED         
     135680580.0     p86maf_run_+      53      5      53     00:19:48    COMPLETED         
     136680580.1     p86maf_run_+      53      5      53     00:10:06 CANCELLED b+         
     137}}} 
     138 
     139# Fat nodes / Thin nodes # 
     140 
     141Fat nodes for the IPSLCM5A-LR coupled model are slower than titane (130%). Thin nodes are two times faster than fat nodes for computations; they are as fast as fat nodes for post processing. 
     142 
     143We decided to use thin nodes for computations and fat nodes for post processing. Be careful! Since November 21st 2012, you must use at least libIGCM_v2.0_rc1 to perform post processing on fat nodes. 
     144 
     145The job header must include  #MSUB -q standard to use thin nodes. 
     146 
     147The job header must include  #MSUB -q xlarge to use fat nodes. 
     148 
     149# Tricks # 
     150  * export LANG=C to correctly display curie.info (by default for new logins) 
     151  * use [SHIFT] [CTL] C to copy part of a text displayed by curie.info 
     152  * use curie to manage your CCCWORKDIR/CCCSTOREDIR directories. Be careful: there could be a delay between the display from curie and the one from titane. For example a file deleted on curie can be displayed as deleted on titane with a delay (cache synchronization). 
     153 
     154# How to use the ddt debuger for the coupled model (or any other MPMD mode) # 
     155 
     156 * compile the model you wish to debug with the -g option (necessary in order to have access to sources from  the ddt interface) 
     157 * create a debug directory which includes the model executables and the input files required by the model 
     158 * create a simplified debug job which allows you to start a run in the debug directory 
     159 * add the command "module load ddt/3.2" to your job 
     160 * add the creation of configuration run_file 
     161 * add a ddt start command in your job 
     162 * delete the environment variable SLURM_SPANK_AUKS : unset SLURM_SPANK_AUKS 
     163{{{ 
     164... 
     165module load ddt/3.2 
     166unset SLURM_SPANK_AUKS 
     167 
     168echo "-np 1 ${DDTPATH}/bin/ddt-client ${TMPDIR_DEBUG}/oasis" > run_file 
     169echo "-np 26 ${DDTPATH}/bin/ddt-client ${TMPDIR_DEBUG}/lmdz.x" >> run_file 
     170echo "-np 5 ${DDTPATH}/bin/ddt-client ${TMPDIR_DEBUG}/opa.xx" >> run_file 
     171 
     172ddt 
     173}}} 
     174 * connect yourself to curie in SSH mode with graphic export (option -X) and enter your password (if you have SSH keys on the front-end machine, move the ~/.ssh/authorized_keys* files outside of the directory, disconnect and reconnect yourself) 
     175 * start the job with graphic export : ccc_msub -X Job 
     176 * when the ddt window appears:  
     177  * click on "Run and Debug a Program"  
     178  * in Application select one of the 3 model executables (which one does not matter) 
     179  * in MPI Implementation choose the "OpenMPI (Compatibility)" mode 
     180  * in mpirun arguments put "--app ${TMPDIR_DEBUG}/run_file" with TMPDIR_DEBUG = debug directory 
     181  * click on "Run" then on the "play" key in the upper left corner 
     182 
     183# Errors on curie when running simulations # 
     184 
     185## Job error: KILLED ... WITH SIGNAL 15  ## 
     186{{{ 
     187slurmd[curie1006]: error: *** STEP 639264.5 KILLED AT 2012-08-01T17:00:29 WITH SIGNAL 15 *** 
     188}}} 
     189 
     190This error message means that the time limit is exceeded. It is easy to receive this error message because there is no file like restartphy.nc. To solve the problem type clean_month and increase the time limit (or decrease !PeriodNb) and restart.  
     191 
     192## Isn't there restart files for LMDZ? ## 
     193 
     194Problem: 
     195 * If the coupled model does not run successfully, the whole chain of commands stops because there is no restart file for LMDZ. Read carefully the out_execution file. 
     196 
     197Solution:  
     198 * look if a file like *error exists in the Debug subdirectory. It contains clear message errors. 
     199 * in the executable directory $SCRATCHDIR/RUN_DIR/xxxx/IPSLCM5A/xxxx look for the out_execution file. If it contains: 
     200{{{ 
     201srun: First task exited 600s ago 
     202srun: tasks 0-40,42-45: running 
     203srun: task 41: exited abnormally 
     204srun: Terminating job step 438782.1 
     205slurmd[curie1150]: *** STEP 438782.1 KILLED AT 2012-06-10T18:45:41 WITH SIGNAL 9 *** 
     206slurmd[curie1151]: *** STEP 438782.1 KILLED AT 2012-06-10T18:45:41 WITH SIGNAL 9 *** 
     207srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
     208slurmd[curie1150]: *** STEP 438782.1 KILLED AT 2012-06-10T18:45:41 WITH SIGNAL 9 *** 
     209slurmd[curie1151]: *** STEP 438782.1 KILLED AT 2012-06-10T18:45:41 WITH SIGNAL 9 *** 
     210}}} 
     211don't ask questions! Type clean_month and restart the simulation. 
     212 
     213## Errors when creating or transfering files ## 
     214 
     215The file system $CCCWORKDIR, $CCCSTOREDIR, $SCRATCHDIR are delicate. The error messages look like: 
     216{{{ 
     217 Input/output error 
     218 Cannot send after transport endpoint shutdown 
     219}}} 
     220 
     221Don't ask question and resubmit the job. 
     222 
     223## Job error: Segmentation fault ## 
     224{{{ 
     225/var/spool/slurmd/job637061/slurm_script: line 534:   458 Segmentation fault      /bin/ksh -x ${TEMPO_SCRIPT} 
     226}}} 
     227 
     228If you have this kind of message don't ask question and resubmit the job. 
     229 
     230## Error when submitting jobs ## 
     231This message: 
     232{{{ 
     233error: Batch job submission failed: Job violates accounting policy (job submit limit, user's size and/or time limits) 
     234}}} 
     235means that you have submitted too many jobs (wait for the jobs to end and resubmit), that your headers are not properly written, or that you did not specify on which genci project the computing time must be deducted. 
     236The ccc_mqinfo command returns the maximum number of jobs (to this day: 300 for 24h-max jobs, 8 for 72h-max jobs and 2 for test jobs (30 min and max 8 nodes)): 
     237{{{ 
     238ccc_mqinfo 
     239Name    Priority  MaxCPUs  MaxNodes  MaxRun  MaxSub     MaxTime 
     240------  --------  -------  --------  ------  ------  ---------- 
     241long          18     1024                 2       8  3-00:00:00  
     242normal        20                                300  1-00:00:00  
     243test          40                  8               2    00:30:00  
     244}}} 
     245 
     246## Long waiting time before a job execution ## 
     247The computation of the users priority is based on 3 cumulated criteria: 
     248 *  Selected QOS (test or not) 
     249 *  The fair-share value of the account (computed from the project and/or partner computation share and the previous use) 
     250 *  Job's age 
     251If your job is far down the waiting list and if you are working on different projects, use the project with the least computing time used. 
     252 
     253This computation is not satisfying because we would prefer to encourage long simulations. We are looking for real examples of anormal waiting situations. Please take the time to give us your feedback. 
     254 
     255## Disk quota exceeded ## 
     256 
     257Be careful to quotas on /scratch! Monitor them with the command ccc_quota. Destroy the temporary directories created by jobs that ended too early and that did not clear the $SCRATCHDIR/TMPDIR_IGCM and $SCRATCHDIR/RUN_DIR directories. You should have a 20 To quota on curie. 
     258{{{ 
     259> ccc_quota 
     260Disk quotas for user xxxx: 
     261 
     262             ------------------ VOLUME --------------------  ------------------- INODE -------------------- 
     263 Filesystem       usage        soft        hard       grace       files        soft        hard       grace 
     264 ----------       -----        ----        ----       -----       -----        ----        ----       ----- 
     265    scratch       3.53T         20T         20T           -      42.61k          2M          2M           -  
     266      store           -           -           -           -      93.76k        100k        101k           -  
     267       work     232.53G          1T        1.1T           -      844.8k        1.5M        1.5M           -  
     268}}} 
     269 
     270# REDO # 
     271 
     272Simulations with the IPSLCM5A coupled model are reproducible if you use the same Bands file for LMDZ. See trusting TGCC/curie on this webpage: http://webservices.ipsl.jussieu.fr/trusting/ 
     273 
     274# Feedback # 
     275 
     276## On November 20th 2012 ## 
     277 
     278The maintenance has noticed and corrected the last two problems. 
     279 
     280## In June 2012 ## 
     281 
     282The 100-yr simulation of piControl in June 2012 : 
     283 * 9% of jobs were resubmitted manually (3/34). 
     284 * 0,6 % of the jobs and post processing were resubmitted manually (4/694). 
     285 
     286## Error to watch in the post processing: WARNING Intra-file non-monotonicity. Record coordinate "time_counter" does not monotonically increase ## 
     287 * To identify quickly: grep -i monoton create_ts* | awk -F: '{print $1}' |sort -u  
     288   * (Command to type in ${SCRATCHDIR}/IGCM_OUT/${!TagName}/${!SpaceName}/${!ExperimentName}/${!JobName}/Out.  
     289   * For example /ccc/scratch/cont003/dsm/p25mart/IGCM_OUT/IPSLCM5A/DEVT/lgm/LGMBR03/Out) 
     290 * Example : 
     291{{{ 
     292+ IGCM_sys_ncrcat --hst -v lon,lat,plev,time_counter,time_counter_bnds,zg v3.rcp45GHG1_20160101_20161231_HF_histhfNMC.nc v3.rcp45GHG1_20170101_20171231_HF_histhfNMC.nc v3.rcp45GHG1_20180101_20181231_HF_histhfNMC.nc v3.rcp45GHG1_20190101_20191231_HF_histhfNMC.nc v3.rcp45GHG1_20200101_20201231_HF_histhfNMC.nc v3.rcp45GHG1_20210101_20211231_HF_histhfNMC.nc v3.rcp45GHG1_20220101_20221231_HF_histhfNMC.nc v3.rcp45GHG1_20230101_20231231_HF_histhfNMC.nc v3.rcp45GHG1_20240101_20241231_HF_histhfNMC.nc v3.rcp45GHG1_20250101_20251231_HF_histhfNMC.nc v3.rcp45GHG1_20160101_20251231_HF_zg.nc 
     293ncrcat: WARNING Intra-file non-monotonicity. Record coordinate "time_counter" does not monotonically increase between (input file v3.rcp45GHG1_20190101_20191231_HF_histhfNMC.nc record indices: 418, 419) (output file v3.rcp45GHG1_20160101_20251231_HF_zg.nc record indices 4798, 4799) record coordinate values 419007600.000000, 0.000000 
     294}}} 
     295 * Check the non monotonic time axis: 
     296{{{ 
     297/ccc/cont003/home/dsm/p86broc/dev_python/is_monotone.py OCE/Analyse/TS_DA/v5.historicalMR3_19800101_19891231_1D_vosaline.nc_KO_monoton time_counter 
     298False 
     299}}} 
     300 * Solution: rm the TS files created and restart with !TimeSeries_Checker 
     301