wiki:Documentation/UserGuide/HangCrash

Version 11 (modified by dgoll, 4 years ago) (diff)

--

What to do if the model is hanging

Author: S. Luyssaert
Last check: 2020/04/20, D. Goll

Objectives

This page provides some information on how to test if your model run is hanging (crashed without terminating the job) or if it is still properly running. In addition, it shows how to avoid the model to hang.

What is a hanging model

You launch the model in a parallel run and you know from previous runs that the run should take, say 600 seconds. After 1200 seconds the model is still running. That looks suspicious! A likely cause of this problem is that one processor is hanging and thus preventing the model to properly finish or to properly crash.

Check whether the model really hangs (i.e. doesn't produce data)

By investigating if the model stopped producing data, which can be done by looking at the time stamp of the files in the running directory of the model using the following command:

 ls -lrt <path to run dir>

How to get the path to run dir: Open the Script_Output file and search for RUN_DIR. You should find a path that looks like /ccc/scratch/cont003/dsm/p529grat/RUN_DIR/XXX/XXX. This is where the model is actually running. If you are working on irene/jean-zay or ciclad you can simply go to that folder and check when the most recent changes were made and to which files.

Special case OBELIX: here the run directory per default on the /scratch where the RUN_DIR is not accessible. To change this, open the Job you want to run and search for RUN_DIR_PATH. The line is commented out per default. This is a good place to specify the run directory you want to use, e.g., RUN_DIR_PATH=/scratch01/sluys/RUN_DIR. Delete the job that was hanging, launch it again and have a look in /scratch01/sluys/RUN_DIR. Details can be found at https://forge.ipsl.jussieu.fr/igcmg_doc/wiki/Doc/Setup

Avoid the model to hang (i.e. avoid unsupported execution stop)

Did you follow the "coding guidelines"? If not, it is time to do so! Check the coding guidelines on the use of CALL ipslerr() instead of STOP. Replace all your STOP statements by a CALL to ipslerr(). Don't be lazy now and add proper information to the ipslerr function else ipslerr may do its job but you still won't know where the model crashes.

Supported execution stop

You can force the model to stop with the following lines of code.

!++++++++TEMP+++++++++
WRITE(numout,*) "This should be the last sentence in all CPUS!"> CALL MPI_BARRIER(MPI_COMM_ORCH,ierr)
CALL ipslerr_p (3,'forestry', 'Seeing if we reach this point...remove!','','')
!+++++++++++++++++++

If this code is pasted in the model before the lines where the model hangs ALL processors will stop with the error message written in ipslerr. If this code is pasted after the lines which make the model hang, all processors will stop with the error message except the one that hangs.

Flush the memory

CALL flush(numout)

This puts everything stored in the buffer of numout into the file. So if you add some write statements and put this call right after them, if you see the write statement in the output file, you know the processor made it to that write statement. If you don't see the write statement, the processor did not make it there. "flush" is not used in generally because it slows down the code a bit, but in debugging it is very useful.