This section describes the monitoring tools, the tools to identify and solve problems, and the tools to monitor and restart the post processing jobs if needed.
We strongly encourage you to check your simulation frequently during run time.
The batch manager at each computing center provides tools to check the status of your jobs. For example to know if the job is on queue, running or suspended.
You can use ccc_mstat on Curie. To see the available options and useful scripts, see Working on Curie.
You can use llq on Ada. To see the available options and useful scripts, see Working on Ada.
When the simulation has started, the file run.card is created by libIGCM using the template run.card.init. run.card contains information of the current run period and the previous periods already finished. This file is updated at each run period by libIGCM. You can find here information of the time consumption of each period. The status of the job is set to OnQueue, Running, Completed or Fatal.
This tool provided with libIGCM allows you to find out your simulations' status.
Below is a diagram of a long simulation that failed.
The script can be started from any machine :
path/to/libIGCM/RunCkecker.job [-u user] [-q] [-j n] [-s] [-p path] job_name
This listing allows you to detect a few known errors :
In some cases (such as for historical simulations where the COSP outputs are activated starting from 1979 ...) this behavior is normal!
During the first integration of a simulation using IPSLCM5, an additional rebuild file is transferred. This extra file is the NEMO "mesh_mask.nc" file. It is created and transferred only during the first step. It is then used for each "rebuild" of the NEMO output files to mask the variables.
Once your simulation is finished you will receive an email saying that the simulation was "Completed" or that it "Failed" and two files will be created in the working directory of your experiment:
A Debug/ directory is created if the simulation failed. This directory contains diagnostic text files for each model component.
If the simulation was successfully completed output files will be stored in the following directory:
with the following subdirectories:
If SpaceName was set to TEST the output files will remain in the work directories
The TimeSeries_Checker.job can be used in diagnostic mode to check if time series have been created. You must change the TimeSeries_Checker.job before starting it in interactive mode (see TimeSeries_Checker.job) and answer n to the following question:
"Run for real (y/n)"
See SE_Checker.job.
Reminder --> This file contains three parts:
These three parts are defined as follows:
####################################### # ANOTHER GREAT SIMULATION # ####################################### 1st part (copying the input files) ####################################### # DIR BEFORE RUN EXECUTION # ####################################### 2nd part (running the model) ####################################### # DIR AFTER RUN EXECUTION # ####################################### 3rd part (post processing)
A few common bugs are listed below:
If the following message is displayed in the second part of the file, it's because there was a problem during the execution:
======================================================================== EXECUTION of : mpirun -f ./run_file > out_run_file 2>&1 Return code of executable : 1 IGCM_debug_Exit : EXECUTABLE !!!!!!!!!!!!!!!!!!!!!!!!!! !! IGCM_debug_CallStack !! !------------------------! !------------------------! IGCM_sys_Cp : out_run_file xxxxxxxxxxxx_out_run_file_error ========================================================================
If the following message is displayed :
======================================================================== EXECUTION of : mpirun -f ./run_file > out_run_file 2>&1 ========================================================================
If there is a message indicating that the "restartphy.nc" file doesn't exist it means that the model simulation was completed but before the end date of your simulation. If this happens and if your model creates an output log other than the simulation output log, you must refer to this log. For example, the output file of the ocean model is stored on the file server under this name:
IGCM_sys_Put_Out : ocean.output xxxxxxxx/OCE/Debug/xxxxxxxx_ocean.output
For LMDZ your output log is the same as the simulation output log and it has not been copied to the storage space. If your simulation has been performed on $SCRATCHDIR (TGCC) you can retrieve it there. Otherwise, you must restart your simulation using $WORKDIR (IDRIS) as the working directory keeping all needed files. You must also change the RUN_DIR_PATH variable. See here before restarting it.
In general, if your simulation stops you can look for the keyword "IGCM_debug_CallStack" in this file. This keyword will come after a line explaining the error you are experiencing.
Example : --Debug1--> IGCM_comp_Update IGCM_debug_Exit : IGCM_comp_Update missing executable create_etat0_limit.e !!!!!!!!!!!!!!!!!!!!!!!!!! !! IGCM_debug_CallStack !! !------------------------!
Your problem could come from a programing error. To find it you can use the text output of the model components located in the Debug subdirectories. Your problem could be caused by the computing environment. This problem is not always easy to identify. It is therefore important to perform benchmark simulations to learn about the usual behavior of a successfully completed simulation.
If the simulation failed due to anormal exit from the executable, a Debug/ directory is created in the working directory. It contains output text files of all model components for your configuration. You should read them to look for errors. For example :
Please, take the time to read and analyze modifications you have done in the code. Nobody codes perfectly.
In this case, it's possible to relaunch the main job to run again the last period.
If the simulation stopped before coming to the end due to an error, it it possible to relaunch the latest period after eventual modifications. The simulation will then read run.card to know where to start and the simulation will continue until the end (if the problem was solved).
To relaunch manually you first need to be sure that no files have been stored for the same period. In libIGCM there are 2 scripts that help you do this cleaning up :
path/to/libIGCM/clean_month.job
path/to/libIGCM/clean_year.job [SSAA] # SSAA = year up to which you are deleting everything (this year included). By default, it's the current year in run.card
Please look at the next paragraph.
You can run post processing jobs once the main job is finished (for example if the post processing job was deactivated in config.card or if you encountered a bug).
On TGCC, the machine used for post processing is the same as the computing machine. On IDRIS, the machine to be used for postprocessing is adapp (since July 2013) with same file system available : $WORKDIR, $HOME, ... . You can :
For the last two options you must first:
cd $PATH_MODIPSL/config/IPSLCM5A/ST11 mkdir -p POST_REDO cd POST_REDO/ cp -pr ../COMP . cp -pr ../POST . cp -pr ../config.card . cp -pr ../run.card .
Before submitting a post processing job at TGCC (rebuild_fromWorkdir.job, pack_debug.job, pack_output.job, pack_restart.job, monitoring.job, create_ts.job, create_se.job) you must make sure that the submission group in present in the job header (#MSUB -A genxxxx). If it isn't, add it.
StandAlone=true libIGCM= # Points to the libIGCM directory of the experiment PeriodDateBegin= # beginning date of the last serie to be "rebuilded" NbRebuildDir= # Number of directories in the series to be "rebuilded" # until the PeriodDateBegin REBUILD_DIR= # Path for the backup of files waiting to be reconstructed # (looking like $SCRATCHDIR/IGCM_OUT/.../JobName/REBUILD or $SCRATCHDIR/TagName/JobName/REBUILD for version older than libIGCM_v2.0 # if RebuildFromArchive=NONE) MASTER=${MASTER:=curie|ada} # Select the computing machine : MASTER=curie for example
ccc_msub rebuild_fromWorkdir.job # TGCC llsubmit rebuild_fromWorkdir.job # IDRIS
The rebuild job submits pack_output.job automatically.
The pack_output (e.g. in case it was not submitted by the rebuild job):
libIGCM=${libIGCM:=::modipsl::/libIGCM} # path of the libIGCM library MASTER=${MASTER:=curie|ada} # machine on which you work DateBegin=${DateBegin:=20000101} # start date of the period to be packed DateEnd=${DateEnd:=20691231} # end date of the period to be packed PeriodPack=${PeriodPack:=10Y} # pack frequency
ccc_msub pack_output.job # TGCC llsubmit pack_output.job # IDRIS
create_ts.job and create_se.job are submitted automatically.
libIGCM=${libIGCM:=::modipsl::/libIGCM} # path of the libIGCM library MASTER=${MASTER:=curie|ada} # machine on which you work DateBegin=${DateBegin:=20000101} # start date of the period to be packed DateEnd=${DateEnd:=20691231} # end date of the period to be packed PeriodPack=${PeriodPack:=10Y} # pack frequency
ccc_msub pack_debug.job ; ccc_msub pack_restart.job # TGCC llsubmit pack_debug.job ; llsubmit pack_restart.job # IDRIS
In case you haven't done it yet, retrieve config.card COMP POST and eventually run.card (post process only part of the simulation) in the POST_REDO/ directory.
There are two ways:
libIGCM=${libIGCM:=...MYEXP/modipsl/libIGCM} # Path of the libIGCM library SpaceName=${SpaceName:=DEVT} ExperimentName=${ExperimentName:=pdControl} JobName=${JobName:=MYEXP} CARD_DIR=${CARD_DIR:=${CURRENT_DIR}} # Path of the experiment directory # (including CURRENT_DIR if you copied # TimeSeries_Checker.job properly) export BRIDGE_MSUB_PROJECT=gen2211 # number of your genci project
./TimeSeries_Checker.job
./TimeSeries_Checker.job 2>&1 | tee TSC_OUT # Create log file grep Batch TSC_OUT # find all the submitted jobs
StandAlone=true libIGCM= # Path of the libIGCM library PeriodDateEnd # end date of the time series to be created CompletedFlag # end date of the existing time series TsTask=2D # select 2D or 3D RebuildFrequency=true
ccc_msub create_ts.job # TGCC llsubmit create_ts.job # IDRIS
If your time series (TS) are 2D and 3D you must run the create_ts jobs twice and change the TsTask variable accordingly.
Transfer config.card, COMP, POST, and run.card (post process part of the simulation only) in the POST_REDO/ directory if you have not done so yet.
There are two methods:
./SE_Checker.job
# Create logfile: ./SE_Checker.job 2>&1 | tee SE_OUT # Find all started jobs : grep Batch SE_OUT
StandAlone=true libIGCM= # path of the libIGCM library PeriodDateEnd= # end date of the decade to be processed
ccc_msub create_se.job # TGCC llsubmit create_se.job # IDRIS