Opened 6 years ago
Closed 8 months ago
#391 closed defect (fixed)
problem with ipslerr_p(MPI_ABORT) at obelix : model stays hanging
Reported by: | jgipsl | Owned by: | ajornet |
---|---|---|---|
Priority: | major | Milestone: | Not scheduled yet |
Component: | Model architecture | Version: | |
Keywords: | Cc: |
Description
Using the current trunk (rev 4600) the ipslerr_p is called from the model with stop level 3, the model stays hanging. The error message is printed in out_orchidee in the run directory but the job stay running in queue.
For a simple test case, set in run.def
OK_FREEZE=y DEPTH_MAX_T=10
This will activate the coherence test in the code. If you set a RUN_DIR_PATH in the main job, you can see during run time that the model has written the output messages
0FATAL ERROR FROM ROUTINE control_initialize 0 --> Too shallow soil chosen for the thermodynamic for soil freezing 0 --> Adapt run.def with at least DEPTH_MAX=11 0 --> 0 0Fatal error from ORCHIDEE. STOP in ipslerr_p with code
but the job is still running in the queue (use qstat).
Note:
- This is only seen when running with XIOS. When using only IOIPSL, the model stops correctly.
- When the model stops from IOIPSL, for example if an input file is missing (for example soils_param.nc), the execution is stopping correctly, even if XIOS is activated.
Tested modifications
Currently in ORCHIDEE/src_parallel/ioipsl_para.f90:
CALL MPI_ABORT(3)
Changing into
CALL MPI_ABORT(MPI_COMM_ORCH,1,ierr) or CALL MPI_ABORT(MPI_COMM_WORLD,1,ierr)
seems to solve the case using XIOS in attached mode but still not the server mode. A better solution is needed.
These problems at obelix seems to be related to the problem at curie in ticket #236
Change History (6)
comment:1 Changed 6 years ago by jgipsl
comment:2 Changed 5 years ago by aducharne
- Milestone set to ORCHIDEE 4.0
- Owner changed from jgipsl to ajornet
- Status changed from new to assigned
comment:3 Changed 3 years ago by luyssaert
- Component changed from Anthropogenic processes to Model architecture
- Milestone changed from ORCHIDEE 4.0 to Not scheduled yet
comment:4 Changed 8 months ago by bguenet
Bertrand Guenet will do a test with current trunk
comment:5 Changed 8 months ago by bguenet
Tests done with the trunk [7853] on obelix. The model now crash, it doesn't hang and the error message is clear.
comment:6 Changed 8 months ago by bguenet
- Resolution set to fixed
- Status changed from assigned to closed
[4683]: Added arguments as said above but it does not solve the problem.