wiki:Documentation/UserGuide/parallel_obelix

Problem when running in parallel with several proc on obelix

Author: B. Guenet
Last revision: 2020/06/16, B. Guenet

Note afterwards

2024/02/12 : Josefine Ghattas
The default behaviour of libIGCM has been changed. The run folder (called RUN_DIR) is now by default created at /home/scratch01/yourlogin instead of at /tmp. This is done to prevent below error.

Question by Laura Sereni on 2020/06/11

I tried to launch a simulation over Europe with 36 procs (nodes=6:ppn=6) and it fails with the following error message:

[obelix22:73775] [[50269,0],0] usock_peer_send_blocking: send() to socket 57 failed: Broken pipe (32) [obelix22:73775] [[50269,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 316

[obelix22:73775] [[50269,0],0]-[[50269,1],3] usock_peer_accept: usock_peer_send_connect_ack failed

Answer by Fabienne Maignan

In obelix, it is not recommended to work on the tmp disk when asking for several proc. Thus, in the Job, you have to modify the RUN_DIR_PATH from

#RUN_DIR_PATH=/workdir/or/scratchdir/of/this/machine

to

RUN_DIR_PATH=/home/diskname/mylogin/RUNDIR

where diskname must be changed by the disk where you want to run (e.g. orchidee01, surface7, etc.) and mylogin is your login. You also have to create the RUNDIR directory before running.

Once the simulation is finished it is very important to clean the RUNDIR to avoid unnecessary storage of forcing files!

Last modified 2 months ago Last modified on 2024-02-12T11:01:02+01:00