Changes between Version 10 and Version 11 of Documentation/UserGuide/HangCrash


Ignore:
Timestamp:
2020-04-20T12:42:21+02:00 (4 years ago)
Author:
dgoll
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Documentation/UserGuide/HangCrash

    v10 v11  
    1 = How to find where the model is hanging  = 
     1= What to do if the model is hanging  = 
    22 
    33Author: S. Luyssaert[[BR]] 
    4 Last check: 2020/02/28, P. Peylin  
     4Last check: 2020/04/20, D. Goll 
    55 
    66[[PageOutline]] 
     
    88== Objectives == 
    99 
    10 This page provides some information on how to test if your model run is hanging (crashed without terminating the job) or if it is still properly running (given that you have not obtained the final outputs that you expect). This can happen due to the use of unsupported ways to stop the execution of the model  
     10This page provides some information on how to test if your model run is hanging (crashed without terminating the job) or if it is still properly running. In addition, it shows how to avoid the model to hang.  
    1111 
    12 '''Context:'''  
     12'''What is a hanging model'''  
    1313 
    14 You launch the model in a parallel run and you know from previous runs that the run should take, say 600 seconds. After 1200 seconds the model is still running. That looks suspicious! A likely cause of this problem is that one processor is hanging and thus preventing the model to properly finish or to properly crash. Here is some advice: 
     14You launch the model in a parallel run and you know from previous runs that the run should take, say 600 seconds. After 1200 seconds the model is still running. That looks suspicious! A likely cause of this problem is that one processor is hanging and thus preventing the model to properly finish or to properly crash.  
    1515 
    16 === Check whether the model really hangs === 
    17 Open the Script_Output file and search for RUN_DIR. You should find a path that looks like /ccc/scratch/cont003/dsm/p529grat/RUN_DIR/XXX/XXX. This is where the model is actually running. If you are working on irene/jean-zay or ciclad you can simply go to that folder and check when the most recent changes were made and to which files. The time of the last changes should give you an indication of whether the model really hangs or whether you are just too impatient. If, however, you are working on OBELIX, the run directory is on the /scratch but the folder where the model is running is not accessible. Open the Job you want to run and search for RUN_DIR_PATH. The instruction will be commented out. This is a good place to specify the run directory you want to use, e.g., RUN_DIR_PATH=/scratch01/sluys/RUN_DIR. Delete the job that was hanging, launch it again and have a look in /scratch01/sluys/RUN_DIR. Details can be found at https://forge.ipsl.jussieu.fr/igcmg_doc/wiki/Doc/Setup  
     16== Check whether the model really hangs (i.e. doesn't produce data) == 
     17By investigating if the model stopped producing data, which can be done by looking at the time stamp of the files in the running directory of the model using the following command: 
     18{{{ 
     19 ls -lrt <path to run dir> 
     20}}} 
    1821 
    19 === (Avoid the use of) unsupported execution stop === 
     22How to get the '''path to run dir''': Open the Script_Output file and search for RUN_DIR. You should find a path that looks like /ccc/scratch/cont003/dsm/p529grat/RUN_DIR/XXX/XXX. This is where the model is actually running. If you are working on irene/jean-zay or ciclad you can simply go to that folder and check when the most recent changes were made and to which files.  
     23 
     24Special case '''OBELIX''': here the run directory per default on the /scratch where the RUN_DIR is not accessible. To change this, open the Job you want to run and search for RUN_DIR_PATH. The line is commented out per default.  
     25This is a good place to specify the run directory you want to use, e.g., RUN_DIR_PATH=/scratch01/sluys/RUN_DIR. Delete the job that was hanging, launch it again and have a look in /scratch01/sluys/RUN_DIR. Details can be found at https://forge.ipsl.jussieu.fr/igcmg_doc/wiki/Doc/Setup  
     26 
     27== Avoid the model to hang (i.e. avoid unsupported execution stop) == 
    2028Did you follow the "coding guidelines"? If not, it is time to do so! Check the coding guidelines on the use of CALL ipslerr() instead of STOP. Replace all your STOP statements by a CALL to ipslerr(). Don't be lazy now and add proper information to the ipslerr function else ipslerr may do its job but you still won't know where the model crashes. 
    2129