WikiPrint - from Polar Technologies

!!! OLD MACHINE, DOESN'T EXIST ANYMORE !!!

Working on the curie machine


1. On-line users manual

2. Job manager commands

3. Before starting a job

3.1. Specify the project name

Since January 2013, you must specify in the header from which project you will use computing time:

#MSUB -A genxxx

3.2. QoS test

QoS (Quality of Service) is a test queue. You can have a maximum of 2 jobs in test queue, each of them is limited to 30min and 70 nodes (= 1120 tasks). In the job header you must add:

#MSUB -Q test

and change the CPU time limit

#MSUB -T 1800  

To check QoS parameters, use :

 > ccc_mqinfo
Name     Partition  Priority  MaxCPUs  SumCPUs  MaxNodes  MaxRun  MaxSub     MaxTime
-------  ---------  --------  -------  -------  --------  ------  ------  ----------
long             *        18     2048     4096                        32  3-00:00:00
normal           *        20                                         300  1-00:00:00
test      standard        40     1260     1260        70               2    00:30:00

4. Other job manager commands

5. Thin nodes

Since April 2016, only thin nodes are available at TGCC. The job header must include #MSUB -q standard to use thin nodes.

5.1. SSD on standard node : how to use it for rebuild job

SSD usage could accelerate rebuild job. It's very useful for medium and high resolution configuration like IPSLCM5A-MR. You have only to change header and RUN_DIR_PATH in rebuild.job. Take care you will run faster but cost will be multiplied by a factor of 16 because standard node ie 16 cpus are dedicated. Beware of the size of the /tmp (64GB/node) : if you have configuration with very high resolution and very high output frequency, the /tmp of standard node could be too small; in this case see below.

#MSUB -q standard # thin nodes
#MSUB -x  # exclusive node
RUN_DIR_PATH=/tmp/REBUILD_DIR_MR_$$

6. Job Header for MPI - MPI/OMP with libIGCM

Since october 2015 and libIGCM_v2.7, ins_job (libIGCM/ins_job) successfully completes job's header. Nevertheless you can check with job's header examples provided here.

6.1. Forced model

6.1.1. MPI

To launch a job on XXX MPI tasks

#MSUB -r MyJob
#MSUB -o Script_Output_MyJob.000001    # standard output
#MSUB -e Script_Output_MyJob.000001    # error output
#MSUB -eo
#MSUB -n XXX                           # number of MPI task
#MSUB -T 86400                         # Wall clock limit (seconds)
#MSUB -q standard                      # thin nodes
#MSUB -A gen****
BATCH_NUM_PROC_TOT=$BRIDGE_MSUB_NPROC

6.1.2. hybrid MPI-OMP

Hybrid version are only available with _v6 configurations

To launch a job on XXX MPI tasks and YYY threads OMP on each task

6.2. Coupled model

6.2.1. MPI

To launch a job on XXX MPI tasks

#MSUB -r MyCoupledJob
#MSUB -o Script_Output_MyCoupledJob.000001    # standard output
#MSUB -e Script_Output_MyCoupledJob.000001    # error output
#MSUB -eo
#MSUB -n XXX                                  # number of MPI task
#MSUB -T 86400                                # Wall clock limit (seconds)
#MSUB -q standard                             # thin nodes
#MSUB -A gen****
BATCH_NUM_PROC_TOT=$BRIDGE_MSUB_NPROC

6.2.2. hybrid MPI-OMP

Hybrid version are only available with _v6 configurations

To launch a job on XXX (27) MPI tasks and YYY (4) threads OMP for LMDZ, ZZZ (19) MPI tasks for NEMO and SSS (1) XIOS servers :

7. Tricks

8. How to use the ddt debugger for the coupled model (or any other MPMD mode)

8.1. MPI only

8.2. Hybrid MPI-OpenMP (use of mpirun -rankfile method)

9. Errors on curie when running simulations

9.1. Job error: KILLED ... WITH SIGNAL 15

slurmd[curie1006]: error: *** STEP 639264.5 KILLED AT 2012-08-01T17:00:29 WITH SIGNAL 15 ***

This error message means that the time limit is exceeded. To solve the problem type clean_PeriodLength.job, increase the time limit (or decrease PeriodNb) and restart.

9.2. Isn't there restart files for LMDZ?

Problem:

Solution:

don't ask questions! Type clean_PeriodLength.job and restart the simulation.

9.3. Errors when creating or transferring files

The file system $CCCWORKDIR, $CCCSTOREDIR, $SCRATCHDIR are delicate. The error messages look like:

 Input/output error
 Cannot send after transport endpoint shutdown

Don't ask question and resubmit the job.

9.4. Job error: Segmentation fault

/var/spool/slurmd/job637061/slurm_script: line 534:   458 Segmentation fault      /bin/ksh -x ${TEMPO_SCRIPT}

If you have this kind of message don't ask question and resubmit the job.

9.5. Error when submitting jobs

This message:

error: Batch job submission failed: Job violates accounting policy (job submit limit, user's size and/or time limits)

means that you have submitted too many jobs (wait for the jobs to end and resubmit), that your headers are not properly written, or that you did not specify on which genci project the computing time must be deducted. The ccc_mqinfo command returns the maximum number of jobs (to this day: 300 for 24h-max jobs, 8 for 72h-max jobs and 2 for test jobs (30 min and max 8 nodes)):

ccc_mqinfo
Name    Priority  MaxCPUs  MaxNodes  MaxRun  MaxSub     MaxTime
------  --------  -------  --------  ------  ------  ----------
long          18     1024                 2       8  3-00:00:00 
normal        20                                300  1-00:00:00 
test          40                  8               2    00:30:00 

9.6. Long waiting time before a job execution

The computation of the users priority is based on 3 cumulated criteria:

If your job is far down the waiting list and if you are working on different projects, use the project with the least computing time used.

This computation is not satisfying because we would prefer to encourage long simulations. We are looking for real examples of abnormal waiting situations. Please take the time to give us your feedback.

9.7. Disk quota exceeded

Be careful to quotas on /scratch! Monitor them with the command ccc_quota. Destroy the temporary directories created by jobs that ended too early and that did not clear the $SCRATCHDIR/TMPDIR_IGCM and $SCRATCHDIR/RUN_DIR directories. You should have a 20 To quota on curie.

> ccc_quota
Disk quotas for user xxxx:

             ------------------ VOLUME --------------------  ------------------- INODE --------------------
 Filesystem       usage        soft        hard       grace       files        soft        hard       grace
 ----------       -----        ----        ----       -----       -----        ----        ----       -----
    scratch       3.53T         20T         20T           -      42.61k          2M          2M           - 
      store           -           -           -           -      93.76k        100k        101k           - 
       work     232.53G          1T        1.1T           -      844.8k        1.5M        1.5M           - 

9.8. A daemon (pid unknown) died unexpectedly with status 1 while attempting to launch so we are aborting.

This message appears when time limit is reached. Increase requested time in job's header or reduce NbPeriod in your job to reduce the number of loop's iteration.

10. REDO

Simulations with the IPSLCM5/IPSLCM6 coupled model are reproducible if you use the same Bands file for LMDZ. See trusting TGCC/curie on this web page: http://webservices.ipsl.jussieu.fr/trusting/