Changes between Version 53 and Version 54 of Doc/CheckDebug


Ignore:
Timestamp:
11/08/19 10:07:23 (4 years ago)
Author:
acosce
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Doc/CheckDebug

    v53 v54  
    1616[[Image(hermes.jpg, 50%)]] 
    1717 
    18 ## How to verify the status of your simulation ## 
    19  
    20 {{{ 
    21 #!comment 
    22 Plot generated with graphviz, script available here :  
    23 https://forge.ipsl.jussieu.fr/igcmg_doc/wiki/DocYgraphvizLibigcmprod 
    24 [[Image(libigcm_prod.jpg, 50%)]]        ===> normal image 
    25 [[Image(libigcm_prod_rotate.jpg, 50%)]] ===> to print a pdf 
    26 }}} 
    27 [[Image(libigcm_prod.jpg, 50%)]] 
    2818 
    2919[[NoteBox(note, We strongly encourage you to check your simulation frequently during run time., 600px)]] 
     
    4939When the simulation has started, the file `run.card` is created by libIGCM using the template `run.card.init`. `run.card` contains information of the current run period and the previous periods already finished. This file is updated at each run period by libIGCM. You can find here information of the time consumption of each period. The status of the job is set to `OnQueue`, `Running`, `Completed` or `Fatal`.  
    5040 
    51 ### !RunChecker ### 
    52  
    53 This tool provided with libIGCM allows you to find out your simulations' status. 
    54  
    55 #### Example of a !RunChecker output for a successful simulation #### 
    56  
    57 [[Image(RunChecker-OK.jpg, 50%)]] 
    58  
    59 #### General description #### 
    60  
    61 Below is a diagram of a long simulation that failed. 
    62  
    63 [[Image(RunChecker_extrait.jpg, 50%)]] 
    64  
    65   1. In this block, you will find information about the simulation directories. 
    66   1. Here is information about the main job; the information comes from the `config.card` and the `run.card` : 
    67     * The first line returns the job name and the date of the last time data saved to disk in the `run.card`. 
    68     * `DateBegin` - `DateEnd` : start and end dates of the simulation as defined in `config.card`. 
    69     * `PeriodState` : variable coming from the `run.card` giving the run's status : 
    70       * `OnQueue`, `Waiting` : the run is queued ; 
    71       * `Running` : the job is running ; 
    72       * `Completed` : the run was completed successfully ; 
    73       * `Fatal` : the run failed. 
    74     * `Current Period` : this variable from `run.card` shows which integration step (most often one month or one year) is being computed. 
    75     * `CumulPeriod` : variable from `run.card`. Number of the period being computed 
    76     * `Pending Rebuilds, Nb | From | To` : number of files waiting to be "rebuild", date of the oldest and the latest files. Most of the configuration use parallel I/O and have not more rebuild steps. 
    77   1. The third block contains the status of the latest post processing jobs, the Rebuilds, the Pack, the Monitoring and the Atlas. Only the computed periods are returned for the Monitoring and the Atlas. For the other processing jobs, the computed periods and the number of successfully transferred files are returned. 
    78   1. Lastly, the current date. 
    79  
    80 #### Usage and options #### 
    81  
    82 The script can be started from any machine : 
    83  
    84 {{{ 
    85 #!sh 
    86 path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] [-s] [-p path] job_name 
    87 }}} 
    88  
    89   * `-u user` : starts the Checker for the simulation of another user 
    90   * `-q` : silence mode 
    91   * `-j n` : displays `n` post processing jobs (10 by default) 
    92   * `-s` : looks for a simulation $WORKDIR and adds it to its catalog of simulations before displaying the information 
    93   * `-p path` : !!!absolute!!! path of the directory containing the `config.card` instead of the job_name. 
    94  
    95 #### Use #### 
    96  
    97 This listing allows you to detect a few known errors : 
    98   * Job running but the date of the last time output was written to disk in the `run.card` is much older than the current date :  
    99   * Date is in red in the post processing listing : this means that errors during file transfers occurred in the post processing job. 
    100   * For a given post processing job, the number of successfully-transferred files varies according to the date : this might mean that errors occurred.  
    101 [[NoteBox(warn, In some cases (such as for historical simulations where the COSP outputs are activated starting from 1979 ...) this behavior is normal!, 600px)]] 
    102   * A `PeriodState` to `Fatal` indicates that an error occurred either in the main job or in one of the post processing jobs. 
    103   * If the number of rebuilds waiting is above... 
    104  
    105 #### Good things to know #### 
    106  
    107 During the first integration of a simulation using IPSLCM5, an additional rebuild file is transferred. This extra file is the NEMO `mesh_mask.nc` file. It is created and transferred only during the first step. It is then used for each "rebuild" of the NEMO output files to mask the variables.  
    10841 
    10942## End of simulation ## 
    11043Once your simulation is finished you will receive an email saying that the simulation was `Completed` or that it `Failed` and two files will be created in the working directory of your experiment:  
    111   * [wiki:DocFsimu#run.cardattheendofasimulation run.card] 
    112   * `Script_Output_JobName` 
    113  
    114 A `Debug` directory is created if the simulation failed in a way that is correctly diagnosed by libIGCM. This directory contains diagnostic text files for each model component. It won't be created if the job reaches the time limit and is stopped by the batch scheduler. 
    115  
    116 If the crash is not properly handeld by libIGCM, you will find a lot of files in `$RUN_DIR`. In `Script_Output_JobName`, find the line starting with `IGCM_sys_Cd : ` and get the location of the RUN_DIR. 
     44  * [wiki:Doc/Running#run.cardattheendofasimulation run.card] 
     45  * [wiki:Doc/Running#Script_Output_JobName Script_Output_JobName] 
     46 
     47A `Debug` directory is created if the Job failed during the simulation part (in other words when the job is executing calculs in modeles). This directory contains diagnostic text files for each configuration components. It won't be created if the job reaches the time limit and is stopped by the batch scheduler. 
     48 
     49If the crash is not properly handeld by libIGCM, you will find a lot of files in `$RUN_DIR`. You can find the path of this directory in `Script_Output_JobName`, search for RUN_DIR.  
     50{{{ 
     51IGCM_sys_MkdirWork : /scratch_of_computer/login/RUN_DIR/Job_number/***/ 
     52IGCM_sys_Cd : /scratch_of_computer/login/RUN_DIR/Job_number/***/ 
     53}}} 
    11754 
    11855If the simulation was successfully completed output files will be stored in the following directory:  
    119   * `$CCCSTORE/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC 
    120   * `ergon:IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at IDRIS 
    121  
    122 with the following subdirectories:  
     56  * `STORE/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC and IDRIS if you are running a DEVT or PROD simulation  
     57  * `SCRATCH/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC and IDRIS if you are running a TEST simulation  
     58 
     59 
     60in case of a DEVT or PROD simulation, you will find the following subdirectories:  
    12361  * `RESTART` = tar of the restart files for all model components and with the pack frequency 
    12462  * `DEBUG` = tar of the debug text files for all model components 
     
    13472  * `ATM/Output`, `CPL/Output`, etc... = NetCDF output of the model components 
    13573 
    136 [[NoteBox(note, If !SpaceName was set to TEST the output files will remain in the work directories,600px)]] 
    137   * `$SCRATCHDIR/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC 
    138   * `$WORKDIR/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at ada/IDRIS 
     74in case of a TEST simulation, you will find the following subdirectories:  
     75  * `ATM` 
     76  * `CPL` 
     77  * `ICE` 
     78  * `OCE` 
     79  * `SRF` 
     80  * `SBG` 
     81  * `Out` = run log files 
     82  * `Exe` = executables used for the run 
     83 
     84  * `ATM/Output`, `CPL/Output`, etc... = NetCDF output files of the model components 
     85  * `ATM/Restart`, `CPL/Restart`, etc... = NetCDF restart files of the model components  
     86  * `ATM/Debug`, `CPL/Debug`, etc... =  text output files of the model components  
     87   
    13988## Diagnostic tools : Checker ## 
    14089### TimeSeries_Checker ###