51 | | ### !RunChecker ### |
52 | | |
53 | | This tool provided with libIGCM allows you to find out your simulations' status. |
54 | | |
55 | | #### Example of a !RunChecker output for a successful simulation #### |
56 | | |
57 | | [[Image(RunChecker-OK.jpg, 50%)]] |
58 | | |
59 | | #### General description #### |
60 | | |
61 | | Below is a diagram of a long simulation that failed. |
62 | | |
63 | | [[Image(RunChecker_extrait.jpg, 50%)]] |
64 | | |
65 | | 1. In this block, you will find information about the simulation directories. |
66 | | 1. Here is information about the main job; the information comes from the `config.card` and the `run.card` : |
67 | | * The first line returns the job name and the date of the last time data saved to disk in the `run.card`. |
68 | | * `DateBegin` - `DateEnd` : start and end dates of the simulation as defined in `config.card`. |
69 | | * `PeriodState` : variable coming from the `run.card` giving the run's status : |
70 | | * `OnQueue`, `Waiting` : the run is queued ; |
71 | | * `Running` : the job is running ; |
72 | | * `Completed` : the run was completed successfully ; |
73 | | * `Fatal` : the run failed. |
74 | | * `Current Period` : this variable from `run.card` shows which integration step (most often one month or one year) is being computed. |
75 | | * `CumulPeriod` : variable from `run.card`. Number of the period being computed |
76 | | * `Pending Rebuilds, Nb | From | To` : number of files waiting to be "rebuild", date of the oldest and the latest files. Most of the configuration use parallel I/O and have not more rebuild steps. |
77 | | 1. The third block contains the status of the latest post processing jobs, the Rebuilds, the Pack, the Monitoring and the Atlas. Only the computed periods are returned for the Monitoring and the Atlas. For the other processing jobs, the computed periods and the number of successfully transferred files are returned. |
78 | | 1. Lastly, the current date. |
79 | | |
80 | | #### Usage and options #### |
81 | | |
82 | | The script can be started from any machine : |
83 | | |
84 | | {{{ |
85 | | #!sh |
86 | | path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] [-s] [-p path] job_name |
87 | | }}} |
88 | | |
89 | | * `-u user` : starts the Checker for the simulation of another user |
90 | | * `-q` : silence mode |
91 | | * `-j n` : displays `n` post processing jobs (10 by default) |
92 | | * `-s` : looks for a simulation $WORKDIR and adds it to its catalog of simulations before displaying the information |
93 | | * `-p path` : !!!absolute!!! path of the directory containing the `config.card` instead of the job_name. |
94 | | |
95 | | #### Use #### |
96 | | |
97 | | This listing allows you to detect a few known errors : |
98 | | * Job running but the date of the last time output was written to disk in the `run.card` is much older than the current date : |
99 | | * Date is in red in the post processing listing : this means that errors during file transfers occurred in the post processing job. |
100 | | * For a given post processing job, the number of successfully-transferred files varies according to the date : this might mean that errors occurred. |
101 | | [[NoteBox(warn, In some cases (such as for historical simulations where the COSP outputs are activated starting from 1979 ...) this behavior is normal!, 600px)]] |
102 | | * A `PeriodState` to `Fatal` indicates that an error occurred either in the main job or in one of the post processing jobs. |
103 | | * If the number of rebuilds waiting is above... |
104 | | |
105 | | #### Good things to know #### |
106 | | |
107 | | During the first integration of a simulation using IPSLCM5, an additional rebuild file is transferred. This extra file is the NEMO `mesh_mask.nc` file. It is created and transferred only during the first step. It is then used for each "rebuild" of the NEMO output files to mask the variables. |
111 | | * [wiki:DocFsimu#run.cardattheendofasimulation run.card] |
112 | | * `Script_Output_JobName` |
113 | | |
114 | | A `Debug` directory is created if the simulation failed in a way that is correctly diagnosed by libIGCM. This directory contains diagnostic text files for each model component. It won't be created if the job reaches the time limit and is stopped by the batch scheduler. |
115 | | |
116 | | If the crash is not properly handeld by libIGCM, you will find a lot of files in `$RUN_DIR`. In `Script_Output_JobName`, find the line starting with `IGCM_sys_Cd : ` and get the location of the RUN_DIR. |
| 44 | * [wiki:Doc/Running#run.cardattheendofasimulation run.card] |
| 45 | * [wiki:Doc/Running#Script_Output_JobName Script_Output_JobName] |
| 46 | |
| 47 | A `Debug` directory is created if the Job failed during the simulation part (in other words when the job is executing calculs in modeles). This directory contains diagnostic text files for each configuration components. It won't be created if the job reaches the time limit and is stopped by the batch scheduler. |
| 48 | |
| 49 | If the crash is not properly handeld by libIGCM, you will find a lot of files in `$RUN_DIR`. You can find the path of this directory in `Script_Output_JobName`, search for RUN_DIR. |
| 50 | {{{ |
| 51 | IGCM_sys_MkdirWork : /scratch_of_computer/login/RUN_DIR/Job_number/***/ |
| 52 | IGCM_sys_Cd : /scratch_of_computer/login/RUN_DIR/Job_number/***/ |
| 53 | }}} |
136 | | [[NoteBox(note, If !SpaceName was set to TEST the output files will remain in the work directories,600px)]] |
137 | | * `$SCRATCHDIR/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at TGCC |
138 | | * `$WORKDIR/IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName` at ada/IDRIS |
| 74 | in case of a TEST simulation, you will find the following subdirectories: |
| 75 | * `ATM` |
| 76 | * `CPL` |
| 77 | * `ICE` |
| 78 | * `OCE` |
| 79 | * `SRF` |
| 80 | * `SBG` |
| 81 | * `Out` = run log files |
| 82 | * `Exe` = executables used for the run |
| 83 | |
| 84 | * `ATM/Output`, `CPL/Output`, etc... = NetCDF output files of the model components |
| 85 | * `ATM/Restart`, `CPL/Restart`, etc... = NetCDF restart files of the model components |
| 86 | * `ATM/Debug`, `CPL/Debug`, etc... = text output files of the model components |
| 87 | |