Changes between Initial Version and Version 1 of Doc/CheckDebug


Ignore:
Timestamp:
03/24/14 16:17:48 (10 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Doc/CheckDebug

    v1 v1  
     1{{{ 
     2#!html 
     3<h1>Monitor, debug and relaunching</h1> 
     4}}} 
     5---- 
     6[[NoteBox(note,This section describes the monitoring tools\, the tools to identify and solve problems\, and the tools to monitor and restart the post processing jobs if needed., 600px)]] 
     7 
     8[[TOC(heading=Table of contents,depth=1,inline)]] 
     9[[PageOutline(1,Table of contents,pullout)]] 
     10---- 
     11 
     12# Check status of your simulations # 
     13 
     14## How to verify the status of your simulation ## 
     15 
     16{{{ 
     17#!comment 
     18Plot generated with graphviz, script available here :  
     19https://forge.ipsl.jussieu.fr/igcmg/wiki/DocYgraphvizLibigcmprod 
     20[[Image(libigcm_prod.jpg, 50%)]]        ===> normal image 
     21[[Image(libigcm_prod_rotate.jpg, 50%)]] ===> to print a pdf 
     22}}} 
     23[[Image(libigcm_prod.jpg, 50%)]] 
     24 
     25[[NoteBox(note, We strongly encourage you to check your simulation frequently during run time., 600px)]] 
     26 
     27 
     28### System tools ### 
     29 
     30The batch manager at each computing center provides tools to check the status of your jobs. For example to know if the job is on queue, running or suspended.  
     31 
     32 
     33#### TGCC #### 
     34 
     35You can use `ccc_mstat` on Curie. To see the available options and useful scripts, see [wiki:DocBenvBtgccAcurie#Jobmanagercommands Working on Curie]. 
     36 
     37#### IDRIS #### 
     38 
     39You can use `llq` on Ada. To see the available options and useful scripts, see  [wiki:DocBenvAidrisAada#Commandstomanagejobsonada Working on Ada]. 
     40 
     41 
     42 
     43### run.card ### 
     44 
     45When the simulation has started, the file run.card is created by libIGCM using the template run.card.init. run.card contains information of the current run period and the previous periods already finished. This file is updated at each run period by libIGCM. You can find here information of the time consumption of each period. The status of the job is set to !OnQueue, Running, Completed or Fatal.  
     46 
     47### !RunChecker ### 
     48 
     49This tool provided with libIGCM allows you to find out your simulations' status. 
     50 
     51#### Example of a !RunChecker output for a successful simulation #### 
     52 
     53[[Image(RunChecker-OK.jpg, 50%)]] 
     54 
     55#### General description #### 
     56 
     57Below is a diagram of a long simulation that failed. 
     58 
     59[[Image(RunChecker_extrait.jpg, 50%)]] 
     60 
     61  1. In this block, you will find information about the simulation directories. 
     62  1. Here is information about the main job; the information comes from the `config.card` and the `run.card` : 
     63    * The first line returns the job name and the date of the last time data saved to disk in the `run.card`. 
     64    * !DateBegin - !DateEnd : start and end dates of the simulation as defined in `config.card`. 
     65    * !PeriodState : variable coming from the `run.card` giving the run's status : 
     66      * !OnQueue, Waiting : the run is queued ; 
     67      * Running : the job is running ; 
     68      * Completed : the run was completed successfully ; 
     69      * Fatal : the run failed. 
     70    * Current Period : this variable from `run.card` shows which integration step (most often which month) is being computed 
     71    * !CumulPeriod : variable from `run.card`. Number of the period being computed 
     72    * Pending Rebuilds, Nb | From | To : number of files waiting to be "rebuild", date of the oldest and the latest files. 
     73  1. The third block contains the status of the latest post processing jobs, the Rebuilds, the Pack, the Monitoring and the Atlas. Only the computed periods are returned for the Monitoring and the Atlas. For the other processing jobs, the computed periods and the number of successfully transferred files are returned. 
     74  1. Lastly, the current date. 
     75 
     76#### Usage and options #### 
     77 
     78The script can be started from any machine : 
     79 
     80{{{ 
     81#!sh 
     82path/to/libIGCM/RunCkecker.job [-u user] [-q] [-j n] [-s] [-p path] job_name 
     83}}} 
     84 
     85  * `-u user` : starts the Checker for the simulation of another user 
     86  * `-q` : silence mode 
     87  * `-j n` : displays n post processing jobs (10 by default) 
     88  * `-s` : looks for a simulation $WORKDIR and adds it to its catalog of simulations before displaying the information 
     89  * `-p path` : !!!absolute!!! path of the directory containing the `config.card` instead of the job_name. 
     90 
     91#### Use #### 
     92 
     93This listing allows you to detect a few known errors : 
     94  * Job running but the date of the last time output was written to disk in the `run.card` is much older than the current date :  
     95  * Date is in red in the post processing listing : this means that errors during file transfers occurred in the post processing job. 
     96  * For a given post processing job, the number of successfully-transfered files varies according to the date : this might mean that errors occurred.  
     97[[NoteBox(warn, In some cases (such as for historical simulations where the COSP outputs are activated starting from 1979 ...) this behavior is normal!, 600px)]] 
     98  * A `PeriodState` to Fatal indicates that an error occurred either in the main job or in one of the post processing jobs. 
     99  * If the number of rebuilds waiting is above... 
     100 
     101#### Good things to know #### 
     102 
     103During the first integration of a simulation using IPSLCM5, an additional rebuild file is transferred. This extra file is the NEMO "mesh_mask.nc" file. It is created and transferred only during the first step. It is then used for each "rebuild" of the NEMO output files to mask the variables.  
     104 
     105## End of simulation ## 
     106Once your simulation is finished you will receive an email saying that the simulation was "Completed" or that it "Failed" and two files will be created in the working directory of your experiment:  
     107  * [wiki:DocFsimu#run.cardattheendofasimulation run.card] 
     108  * `Script_Output_JobName` 
     109 
     110A `Debug/` directory is created if the simulation failed. This directory contains diagnostic text files for each model component. 
     111 
     112If the simulation was successfully completed output files will be stored in the following directory:  
     113  * `$CCCSTORE/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC 
     114  * `gaya:IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at IDRIS 
     115 
     116with the following subdirectories:  
     117  * `RESTART` = tar of the restart files for all model components and with the pack frequency 
     118  * `DEBUG` = tar of the debug text files for all model components 
     119  * `ATM` 
     120  * `CPL` 
     121  * `ICE` 
     122  * `OCE` 
     123  * `SRF` 
     124  * `SBG` 
     125  * `Out` = run log files 
     126  * `Exe` = executables used for the run 
     127 
     128  * `ATM/Output`, `CPL/Output`, etc... = NetCDF output of the model components 
     129 
     130[[NoteBox(note, If !SpaceName was set to TEST the output files will remain in the work directories,600px)]] 
     131  * `$SCRATCHDIR/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC 
     132  * `$WORKDIR/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at ada/IDRIS 
     133## Diagnostic tools : Checker ## 
     134### !TimeSeries_Checker ### 
     135 
     136The `TimeSeries_Checker.job` can be used in diagnostic mode to check if time series have been created.  
     137You must change the `TimeSeries_Checker.job` before starting it in interactive mode (see [#TimeSeries_checker.job-Recommendedmethod TimeSeries_Checker.job]) and answer `n` to the following question: 
     138{{{ 
     139#!sh 
     140"Run for real (y/n)" 
     141}}} 
     142 
     143### SE_Checker ### 
     144 
     145See [#SE_Checker.jobrecommendedmethod SE_Checker.job]. 
     146 
     147 
     148---- 
     149 
     150# Analyzing the Job output : Script_Output # 
     151Reminder --> This file contains three parts:  
     152 * copying the input files 
     153 * running the model  
     154 * post processing 
     155These three parts are defined as follows:  
     156{{{ 
     157####################################### 
     158#       ANOTHER GREAT SIMULATION      # 
     159####################################### 
     160 
     161 1st part (copying the input files) 
     162 
     163####################################### 
     164#      DIR BEFORE RUN EXECUTION       # 
     165####################################### 
     166 
     167 2nd part (running the model) 
     168 
     169####################################### 
     170#       DIR AFTER RUN EXECUTION       # 
     171####################################### 
     172 
     173 3rd part (post processing) 
     174 
     175}}} 
     176 
     177A few common bugs are listed below:  
     178 
     179 * if the file ends before the second part, possible reasons can be:  
     180   * you didn't delete the existing run.card file in case you wanted to overwrite the simulation;  
     181   * you didn't specify !OnQueue in the run.card file in case you wanted to continue the simulation;  
     182   * one of the input files was missing (e.g. it doesn't exist, the machine has a problem,...);  
     183   * the frequencies (!RebuildFrequency, !PackFrequency ...) do not match !PeriodLength. 
     184 
     185 * if the file ends in the middle of the second part, it's most likely because you didn't request enough memory or CPU time. 
     186 
     187 * if the file ends in the third part, it could be caused by:  
     188   * an error during the execution; 
     189   * a problem while copying the output;  
     190   * a problem when starting the post processing jobs. 
     191 
     192If the following message is displayed in the second part of the file, it's because there was a problem during the execution: 
     193{{{ 
     194======================================================================== 
     195EXECUTION of : mpirun -f ./run_file > out_run_file 2>&1 
     196Return code of executable : 1 
     197IGCM_debug_Exit :  EXECUTABLE 
     198 
     199!!!!!!!!!!!!!!!!!!!!!!!!!! 
     200!! IGCM_debug_CallStack !! 
     201!------------------------! 
     202 
     203!------------------------! 
     204IGCM_sys_Cp : out_run_file xxxxxxxxxxxx_out_run_file_error 
     205======================================================================== 
     206}}} 
     207If the following message is displayed : 
     208{{{ 
     209======================================================================== 
     210EXECUTION of : mpirun -f ./run_file > out_run_file 2>&1 
     211======================================================================== 
     212}}} 
     213 
     214If there is a message indicating that the "restartphy.nc" file doesn't exist it means that the model simulation was completed but before the end date of your simulation. If this happens and if your model creates an output log other than the simulation output log, you must refer to this log. 
     215For example, the output file of the ocean model is stored on the file server under this name: 
     216{{{ 
     217IGCM_sys_Put_Out : ocean.output xxxxxxxx/OCE/Debug/xxxxxxxx_ocean.output 
     218}}} 
     219For LMDZ your output log is the same as the simulation output log and it has not been copied to the storage space. If your simulation has been performed on $SCRATCHDIR (TGCC) you can retrieve it there. Otherwise, you must restart your simulation using $WORKDIR (IDRIS) as the working directory keeping all needed files. You must also change the RUN_DIR_PATH variable. See [#run_dir_path here] before restarting it. 
     220 
     221 
     222[[NoteBox(tip,In general\, if your simulation stops you can look for the keyword "IGCM_debug_CallStack" in this file. This keyword will come after a line explaining the error you are experiencing., 600px)]] 
     223{{{ 
     224Example :  
     225 
     226--Debug1--> IGCM_comp_Update 
     227 
     228IGCM_debug_Exit :  IGCM_comp_Update missing executable create_etat0_limit.e 
     229 
     230!!!!!!!!!!!!!!!!!!!!!!!!!! 
     231!! IGCM_debug_CallStack !! 
     232!------------------------! 
     233}}} 
     234 
     235# Debug # 
     236## Where does the problem come from ? ## 
     237 
     238Your problem could come from a programing error. To find it you can use the text output of the model components located in the Debug subdirectories. Your problem could be caused by the computing environment. This problem is not always easy to identify. It is therefore important to perform benchmark simulations to learn about the usual behavior of a successfully completed simulation. 
     239 
     240### The Debug directory ### 
     241 
     242If the simulation failed due to anormal exit from the executable, a Debug/ directory is created in the working directory. It contains output text files of all model components for your configuration. You should read them to look for errors. For example : 
     243 
     244 * xxx_out_gcm.e_error --> lmdz  text output 
     245 * xxx_out_orchidee --> orchidee text output  
     246 * xxx_ocean.output --> nemo text output 
     247 * xxx_inca.out --> inca text output 
     248 * xxx_run.def --> lmdz parameter files 
     249 * xxx_gcm.def --> lmdz  parameter files 
     250 * xxx_traceur.def --> lmdz  parameter files  
     251 * xxx_physiq.def --> lmdz  parameter files 
     252 * xxx_orchidee.def --> orchidee parameter files 
     253 
     254### Programming error ### 
     255Please, take the time to read and analyze modifications you have done in the code. Nobody codes perfectly. 
     256 
     257### Unknown error ### 
     258In this case, it's possible to relaunch the main job to run again the last period. 
     259 
     260If the simulation stopped before coming to the end due to an error, it it possible to relaunch the latest period after eventual modifications. The simulation will then read run.card to know where to start and the simulation will continue until the end (if the problem was solved). 
     261 
     262To relaunch manually you first need to be sure that no files have been stored for the same period. In libIGCM there are 2 scripts that help you do this cleaning up : 
     263 
     264  * The error occurred before the packs have been created: 
     265{{{ 
     266#!sh 
     267path/to/libIGCM/clean_month.job 
     268}}} 
     269  * The error occurred after the packs were created: 
     270{{{ 
     271#!sh 
     272path/to/libIGCM/clean_year.job [SSAA] 
     273# SSAA = year up to which you are deleting everything (this year included). By default, it's the current year in run.card 
     274}}} 
     275 
     276## Start or restart post processing jobs ## 
     277Please look at the next paragraph. 
     278 
     279# Start or restart post processing jobs # 
     280 
     281You can run post processing jobs once the main job is finished (for example if the post processing job was deactivated in [#Lespost-traitementsdansconfig.card config.card] or if you encountered a [#Debug bug]). 
     282 
     283On TGCC, the machine used for post processing is the same as the computing machine. On IDRIS, the machine to be used for postprocessing is adapp (since July 2013) with same file system available : $WORKDIR, $HOME, ... . You can :  
     284  1. work directly in the experiment directory (which looks like `PATH_MODIPSL/config/IPSLCM5A/ST11/`) ; 
     285  1. work in a dedicated directory located in the experiment directory (e.g. `PATH_MODIPSL/config/IPSLCM5A/ST11/POST_REDO`) ; 
     286  1. work in a dedicated directory which is independant of the experiment directory (e.g. `$WORKDIR/POST_REDO`). 
     287 
     288For the last two options you must first: 
     289  * If your post processing directory is a subdirectory located in the experiment directory, copy (or make a link `ln -s`) the files and directories `config.card` `POST`, `COMP` and run.card if you want to post process only part of the simulation 
     290  * If your post processing directory is an independant directory : 
     291    * create a dedicated directory (for all simulations) ; 
     292    * transfer libIGCM and run ins_job (for all simulations) ; 
     293    * create a directory for the simulation to analyze (i.e. for each simulation) : 
     294{{{ 
     295#!sh 
     296cd $PATH_MODIPSL/config/IPSLCM5A/ST11 
     297mkdir -p POST_REDO 
     298cd POST_REDO/ 
     299cp -pr ../COMP  .  
     300cp -pr ../POST  . 
     301cp -pr ../config.card  . 
     302cp -pr ../run.card . 
     303}}} 
     304 
     305 
     306[[NoteBox(warn, Before submitting a post processing job at TGCC (`rebuild_fromWorkdir.job`\, `pack_debug.job`\, `pack_output.job`\, `pack_restart.job`\, `monitoring.job`\, `create_ts.job`\, `create_se.job`) you must make sure that the submission group in present in the job header (#MSUB -A genxxxx). If it isn't\, add it., 600px)]] 
     307## Restart REBUILD ## 
     308 
     309 * Copy the `rebuild_fromWorkdir.job` file to the experiment directory or to the dedicated directory; 
     310 
     311 * Edit it: 
     312{{{ 
     313#!sh 
     314StandAlone=true 
     315 
     316libIGCM=                    # Points to the libIGCM directory of the experiment  
     317 
     318PeriodDateBegin=            # beginning date of the last serie to be "rebuilded"  
     319 
     320NbRebuildDir=               # Number of directories in the series to be "rebuilded"  
     321                            # until the PeriodDateBegin  
     322 
     323 
     324REBUILD_DIR=                # Path for the backup of files waiting to be reconstructed  
     325                            # (looking like $SCRATCHDIR/IGCM_OUT/.../JobName/REBUILD or $SCRATCHDIR/TagName/JobName/REBUILD  for version older than libIGCM_v2.0 
     326                            # if RebuildFromArchive=NONE)  
     327 
     328MASTER=${MASTER:=curie|ada} # Select the computing machine : MASTER=curie for example 
     329}}} 
     330 
     331 
     332 * Submit the job: 
     333{{{ 
     334#!sh 
     335ccc_msub rebuild_fromWorkdir.job                # TGCC 
     336 
     337llsubmit rebuild_fromWorkdir.job                # IDRIS 
     338}}} 
     339 
     340 
     341[[NoteBox(note, The rebuild job submits `pack_output.job` automatically., 600px)]] 
     342## Restart Pack_output ## 
     343 
     344 The pack_output (e.g. in case it was not submitted by the rebuild job): 
     345 
     346  * Copy the `libIGCM/pack_output.job` file to the experiment directory or to the dedicated directory; 
     347  * Edit it :  
     348{{{ 
     349#!sh 
     350libIGCM=${libIGCM:=::modipsl::/libIGCM}         # path of the libIGCM library 
     351 
     352MASTER=${MASTER:=curie|ada}                     # machine on which you work 
     353 
     354DateBegin=${DateBegin:=20000101}                # start date of the period to be packed 
     355   
     356DateEnd=${DateEnd:=20691231}                    # end date of the period to be packed 
     357   
     358PeriodPack=${PeriodPack:=10Y}                   # pack frequency 
     359}}} 
     360  * Submit the job: 
     361{{{ 
     362#!sh 
     363ccc_msub pack_output.job                        # TGCC 
     364 
     365llsubmit pack_output.job                        # IDRIS 
     366}}} 
     367 
     368[[NoteBox(note, `create_ts.job` and `create_se.job` are submitted automatically., 600px)]] 
     369 
     370## Restart Pack_restart or Pack_debug ## 
     371 
     372   * Copy the libIGCM/pack_debug.job and libIGCM/pack_restart.job files to the experiment directory or to the dedicated directory; 
     373   * Edit them :  
     374{{{ 
     375#!sh 
     376libIGCM=${libIGCM:=::modipsl::/libIGCM}         # path of the libIGCM library 
     377 
     378MASTER=${MASTER:=curie|ada}                     # machine on which you work 
     379 
     380DateBegin=${DateBegin:=20000101}                # start date of the period to be packed 
     381   
     382DateEnd=${DateEnd:=20691231}                    # end date of the period to be packed 
     383   
     384PeriodPack=${PeriodPack:=10Y}                   # pack frequency 
     385}}} 
     386   * Submit the two jobs:  
     387{{{ 
     388#!sh 
     389ccc_msub pack_debug.job ; ccc_msub pack_restart.job      # TGCC 
     390 
     391llsubmit pack_debug.job ; llsubmit pack_restart.job      # IDRIS 
     392}}} 
     393 
     394 
     395 
     396## Restart the Time series ## 
     397 
     398[[NoteBox(tip, In case you haven't done it yet\, retrieve `config.card` `COMP` `POST` and eventually `run.card` (post process only part of the simulation) in the `POST_REDO/` directory., 600px)]] 
     399 
     400There are two ways: 
     401 
     402### !TimeSeries_checker.job - Recommended method ### 
     403 
     404   * Copy the `libIGCM/TimeSeries_Checker.job` file to the experiment directory or to the dedicated directory; 
     405   * Edit it: 
     406{{{ 
     407#!sh 
     408libIGCM=${libIGCM:=...MYEXP/modipsl/libIGCM}    # Path of the libIGCM library 
     409 
     410SpaceName=${SpaceName:=DEVT}                  
     411 
     412ExperimentName=${ExperimentName:=pdControl} 
     413 
     414JobName=${JobName:=MYEXP} 
     415 
     416CARD_DIR=${CARD_DIR:=${CURRENT_DIR}}            # Path of the experiment directory  
     417                                                # (including CURRENT_DIR if you copied  
     418                                                # TimeSeries_Checker.job properly)  
     419 
     420export BRIDGE_MSUB_PROJECT=gen2211              # number of your genci project 
     421 
     422}}} 
     423   * Run the !TimeSeries_Checker.job in interactive mode. It will call the missing create_ts jobs : 
     424{{{ 
     425#!sh 
     426./TimeSeries_Checker.job 
     427}}} 
     428     or alternatively, in ksh : 
     429{{{ 
     430#!sh 
     431./TimeSeries_Checker.job 2>&1 | tee TSC_OUT     # Create log file 
     432 
     433grep Batch TSC_OUT                              # find all the submitted jobs 
     434}}} 
     435 
     436 
     437### Restart create_ts.job ### 
     438 
     439   * Copy the `libIGCM/create_ts.job` file to the experiment directory or to the dedicated directory; 
     440   * Edit it: 
     441{{{ 
     442#!sh 
     443StandAlone=true 
     444 
     445libIGCM=                                        # Path of the libIGCM library 
     446 
     447PeriodDateEnd                                   # end date of the time series to be created 
     448 
     449CompletedFlag                                   # end date of the existing time series  
     450 
     451TsTask=2D                                       # select 2D or 3D 
     452 
     453RebuildFrequency=true 
     454}}} 
     455   * Run the job:  
     456{{{ 
     457#!sh 
     458ccc_msub create_ts.job                          # TGCC 
     459 
     460llsubmit create_ts.job                          # IDRIS 
     461}}} 
     462 
     463[[NoteBox(note, If your time series (TS) are 2D and 3D you must run the create_ts jobs twice and change the !TsTask variable accordingly., 600px)]] 
     464 
     465## Restarting the seasonal mean calculations ## 
     466 
     467[[NoteBox(tip, Transfer `config.card`\, `COMP`\, `POST`\, and `run.card` (post process part of the simulation only) in the `POST_REDO/` directory if you have not done so yet., 600px)]] 
     468 
     469There are two methods:  
     470 
     471### SE_Checker.job (recommended method) ### 
     472 
     473 * Copy the `libIGCM/SE_Checker.job` file to the experiment directory or to the dedicated directory; 
     474 * Edit it ; 
     475 * Run `SE_checker.job` in interactive mode. This will call the create_se jobs:  
     476{{{ 
     477#!sh 
     478./SE_Checker.job 
     479}}}  
     480   or alternatively, in ksh :  
     481{{{ 
     482#!sh 
     483# Create logfile: 
     484./SE_Checker.job 2>&1 | tee SE_OUT 
     485# Find all started jobs : 
     486grep Batch SE_OUT 
     487}}} 
     488  
     489### Restart create_se.job ### 
     490 
     491 * Copy the `libIGCM/create_se.job` file to the experiment directory or to the dedicated directory; 
     492 * Edit it: 
     493{{{ 
     494#!sh 
     495StandAlone=true 
     496 
     497libIGCM=                                        # path of the libIGCM library 
     498 
     499PeriodDateEnd=                                  # end date of the decade to be processed 
     500}}} 
     501 * Submit the job:  
     502{{{ 
     503#!sh 
     504ccc_msub create_se.job                          # TGCC 
     505 
     506llsubmit create_se.job                          # IDRIS 
     507}}} 
     508