Opened 6 years ago

Closed 8 months ago

#391 closed defect (fixed)

problem with ipslerr_p(MPI_ABORT) at obelix : model stays hanging

Reported by: jgipsl Owned by: ajornet
Priority: major Milestone: Not scheduled yet
Component: Model architecture Version:
Keywords: Cc:

Description

Using the current trunk (rev 4600) the ipslerr_p is called from the model with stop level 3, the model stays hanging. The error message is printed in out_orchidee in the run directory but the job stay running in queue.

For a simple test case, set in run.def

OK_FREEZE=y
DEPTH_MAX_T=10

This will activate the coherence test in the code. If you set a RUN_DIR_PATH in the main job, you can see during run time that the model has written the output messages

0FATAL ERROR FROM ROUTINE control_initialize
0 --> Too shallow soil chosen for the thermodynamic for soil freezing
0 --> Adapt run.def with at least DEPTH_MAX=11
0 -->
0
0Fatal error from ORCHIDEE. STOP in ipslerr_p with code

but the job is still running in the queue (use qstat).

Note:

  • This is only seen when running with XIOS. When using only IOIPSL, the model stops correctly.
  • When the model stops from IOIPSL, for example if an input file is missing (for example soils_param.nc), the execution is stopping correctly, even if XIOS is activated.

Tested modifications

Currently in ORCHIDEE/src_parallel/ioipsl_para.f90:

CALL MPI_ABORT(3)

Changing into

CALL MPI_ABORT(MPI_COMM_ORCH,1,ierr) 
or 
CALL MPI_ABORT(MPI_COMM_WORLD,1,ierr)

seems to solve the case using XIOS in attached mode but still not the server mode. A better solution is needed.

These problems at obelix seems to be related to the problem at curie in ticket #236

Change History (6)

comment:1 Changed 6 years ago by jgipsl

[4683]: Added arguments as said above but it does not solve the problem.

comment:2 Changed 5 years ago by aducharne

  • Milestone set to ORCHIDEE 4.0
  • Owner changed from jgipsl to ajornet
  • Status changed from new to assigned

comment:3 Changed 3 years ago by luyssaert

  • Component changed from Anthropogenic processes to Model architecture
  • Milestone changed from ORCHIDEE 4.0 to Not scheduled yet

comment:4 Changed 8 months ago by bguenet

Bertrand Guenet will do a test with current trunk

comment:5 Changed 8 months ago by bguenet

Tests done with the trunk [7853] on obelix. The model now crash, it doesn't hang and the error message is clear.

comment:6 Changed 8 months ago by bguenet

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.