Opened 8 months ago
Last modified 7 months ago
#190 new defect
XIOS3-trunk with NEMO - Hangs at writing step (legacy) or 'Wrong window flavor' error (one-sided)
Reported by: | acc | Owned by: | ymipsl |
---|---|---|---|
Priority: | major | Component: | XIOS |
Version: | trunk | Keywords: | XIOS3 |
Cc: |
Description
Following on from #189, I'm struggling with a first attempt at using XIOS3 pools and services with a production-style NEMO configuration. The set up is an eORCA025 with 1019 cores and 16 servers (10 gatherers and 6 writers (in multiple_file mode)). Output is just a 2D field for now at daily frequency.
With default comms (legacy), the model runs to the end of the first day but then hangs without producing any output. With one-sided comms activated the model doesn't get past the initialisation (probably the close context point) and produces these errors:
MPICH ERROR [Rank 874] [job id 4358547.0] [Fri Sep 1 10:11:59 2023] [nid004313] - Abort(210908474) (rank 874 in comm 0): Fatal error in PMPI_Win_attach: Incorrect window flavor, error stack: PMPI_Win_attach(139)......: MPI_Win_attach(win=0x10, base=0x18b4d000, size=-536870867) failed MPICH ERROR [Rank 961] [job id 4358547.0] [Fri Sep 1 10:11:59 2023] [nid004314] - Abort(546452794) (rank 961 in comm 0): Fatal error in PMPI_Win_attach: Incorrect window flavor, error stack: MPID_Win_attach(110)......: PMPI_Win_attach(139)......: MPI_Win_attach(win=0x10, base=0x1862c400, size=-536870867) failed MPIDIG_mpi_win_attach(891): Incorrect window flavor MPID_Win_attach(110)......: MPIDIG_mpi_win_attach(891): Incorrect window flavor aborting job:
This is on the same Cray cluster and using the same compilers and environment that were successful in #189. I'll step back and try again with a smaller NEMO example.
Attachments (1)
Change History (29)
comment:1 Changed 8 months ago by ymipsl
comment:2 Changed 8 months ago by acc
A few observations from an ORCA2_ICE_PISCES configuration (32 ocean cores + 4 servers).
- Yes it runs as normal with XIOS3 libraries and XIOS3 xios_server.exe and a XIOS2-style iodef.xml:
<?xml version="1.0"?> <simulation> <!-- ============================================================================================ --> <!-- XIOS context --> <!-- ============================================================================================ --> <!-- XIOS2 --> <context id="xios" > <variable_definition> <variable id="info_level" type="int">10</variable> <variable id="using_server" type="bool">false</variable> <variable id="using_oasis" type="bool">false</variable> <!-- <variable id="oasis_codes_id" type="string" >oceanx</variable> --> </variable_definition> </context> <!-- ============================================================================================ --> <!-- NEMO CONTEXT add and suppress the components you need --> <!-- ============================================================================================ --> <context id="nemo" src="./context_nemo.xml"/> <!-- NEMO --> </simulation>
(with one change because the unused oasis_codes_id is not recognised)
- It also runs with a basic XIOS3 style (no pools or services):
<?xml version="1.0"?> <simulation> <!-- ============================================================================================ --> <!-- XIOS context --> <!-- ============================================================================================ --> <!-- XIOS3 --> <context id="xios" > <variable_definition> <variable_group id="buffer"> <variable id="min_buffer_size" type="int">250000</variable> <variable id="optimal_buffer_size" type="string">memory</variable> </variable_group> <variable_group id="parameters" > <variable id="using_server" type="bool">true</variable> <variable id="info_level" type="int">10</variable> <variable id="print_file" type="bool">false</variable> <variable id="using_server2" type="bool">false</variable> <variable id="ratio_server2" type="int">50</variable> <variable id="pure_one_sided" type="bool">false</variable> <variable id="check_event_sync" type="bool">false</variable> <variable id="using_oasis" type="bool">false</variable> </variable_group> </variable_definition> </context> <!-- ============================================================================================ --> <!-- NEMO CONTEXT add and suppress the components you need --> <!-- ============================================================================================ --> <context id="nemo" src="./context_nemo.xml"/> <!-- NEMO --> </simulation>
but didn't finish cleanly and was left hanging after the final xios_finalize call. It stayed in this state until the time expired but all output files were complete. Adding a MPI_FINALIZE call (actually call mppstop) after xios_finalize in nemogcm.F90 fixes this. Have the internals of xios_finalize changed? i.e. did it used to perform a MPI_FINALIZE but no longer does?
- Taking the same (working) setup and assigning a pool with all the servers also works. Only differences in the iodef.xml between this and the previous being:
-
.xml
diff -u iodef_nopool.xml iodef_pool.xml
old new 24 24 <variable id="using_oasis" type="bool">false</variable> 25 25 </variable_group> 26 26 </variable_definition> 27 <pool_definition> 28 <pool name="pool_ocean" nprocs="4"> 29 <service name="owriter" nprocs="4" type="writer"/> 30 </pool> 31 </pool_definition> 27 32 </context> 28 33 34 29 35 <!-- ============================================================================================ --> 30 36 <!-- NEMO CONTEXT add and suppress the components you need --> 31 37 <!-- ============================================================================================ --> 32 38 33 <context id="nemo" src="./context_nemo.xml"/> <!-- NEMO -->39 <context id="nemo" default_pool_writer="pool_ocean" src="./context_nemo.xml"/> <!-- NEMO --> 34 40 35 41 </simulation>
-
and additional entries in all the <file id settings, e.g.:
-
file_def_nemo-oce.xml
diff -u Original/file_def_nemo-oce.xml file_def_nemo-oce.xml
old new 11 11 <file_definition type="one_file" name="@expname@_@freq@_@startdate@_@enddate@" sync_freq="1mo" min_digits="4"> 12 12 13 13 <file_group id="5d" output_freq="5d" output_level="10" enabled=".TRUE."> <!-- 5d files --> 14 <file id="file11" name_suffix="_grid_T" description="ocean T grid variables" >14 <file id="file11" name_suffix="_grid_T" mode="write" writer="owriter" description="ocean T grid variables" > 15 15 <field field_ref="e3t" />
Something to build on.
comment:3 Changed 8 months ago by ymipsl
<context id="nemo" default_pool_writer="owriter" src="./context_nemo.xml"/>
seems me incorrect since owriter is a service, not a pool.
Would you mean :
<context id="nemo" default_pool_writer="pool_ocean" src="./context_nemo.xml"/>
comment:4 Changed 8 months ago by acc
Yes, that was a cut and paste error. Already corrected.
comment:5 Changed 8 months ago by ymipsl
Ok, great. So let testing it with gatherers and writers ?
For the xios finalize problem, XIOS call MPI_finalize only if MPI was not initialized before the xios_initialize call (ie XIOS call MPI_Finalize only if it has called himself MPI_Init, which is done only if MPI is not initialized before).
Currently, what is done in your NEMO version ?
comment:6 Changed 8 months ago by acc
Final test today: Splitting the pool into 2 writer services (owriter1 and owriter2, each with 2 procs) and assigning different files to different services also works. For example, sending 5day means to one service and monthly means to the other.
The MPI_Finalize issue was my fault. I had made local mods when trying to solve issues with the earlier version of XIOS3 and that included adding a MPI_INIT outside of xios_initialize. Neither that nor my addition mppstop is now required.
Next week, I'll check ORCA2 is still happy when run on more than one node and then re-try the eORCA025 with just writers before adding in the complicataion of gatherers.
comment:7 Changed 8 months ago by acc
Ok, tried the ORCA2_ICE_PISCES config over 2 nodes and it exhibits the same hanging behaviour as the larger config (regardless of transport protocol). If I disable the file_def files in the context_nemo.xml then it runs (but, obviously, doesn't create any output via xios). The same is true of the larger config spread over many more nodes.
I've put some print statements in with flushes and the code (with the file_defs reinstated) never returns from the call to iom_init_closedef at the top of step. Interestingly, it isn't the xios_close_context_definition() call that is hanging (at least, for the reporting core) but rather the xios_update_calendar call that follows:
IOM closing def Calling xios_close_context_definition() Returned from xios_close_context_definition() Calling xios_update_calendar( 0 )
I'll need to verify this for all cores. At least that is manageable for the ORCA2 config.
The print statements currently are in stpmlf.F90:
IF( kstp == nit000 ) THEN ! initialize IOM context (must be done after nemo_init for AGRIF+XIOS+OASIS) IF(lwm) THEN ; write(numout,*) 'IOM initialise ' ; CALL FLUSH(numout) ; ENDIF CALL iom_init( cxios_context, ld_closedef=.FALSE. ) ! for model grid (including possible AGRIF zoom) IF(lwm) THEN ; write(numout,*) 'IOM initialised' ; CALL FLUSH(numout) ; ENDIF IF( lk_diamlr ) CALL dia_mlr_iom_init ! with additional setup for multiple-linear-regression analysis IF(lwm) THEN ; write(numout,*) 'IOM closing def ' ; CALL FLUSH(numout) ; ENDIF CALL iom_init_closedef IF( ln_crs ) CALL iom_init( TRIM(cxios_context)//"_crs" ) ! for coarse grid IF(lwm) THEN ; write(numout,*) 'IOM closed def ' ; CALL FLUSH(numout) ; ENDIF ENDIF
and iom.F90:
SUBROUTINE iom_init_closedef(cdname) !!---------------------------------------------------------------------- !! *** SUBROUTINE iom_init_closedef *** !!---------------------------------------------------------------------- !! !! ** Purpose : Closure of context definition !! !!---------------------------------------------------------------------- CHARACTER(len=*), OPTIONAL, INTENT(IN) :: cdname #if defined key_xios LOGICAL :: llrstw llrstw = .FALSE. IF(PRESENT(cdname)) THEN llrstw = (cdname == cw_ocerst_cxt) llrstw = llrstw .OR. (cdname == cw_icerst_cxt) llrstw = llrstw .OR. (cdname == cw_ablrst_cxt) llrstw = llrstw .OR. (cdname == cw_toprst_cxt) llrstw = llrstw .OR. (cdname == cw_sedrst_cxt) ENDIF IF( llrstw ) THEN !set names of the fields in restart file IF using XIOS to write data CALL iom_set_rst_context(.FALSE.) CALL xios_close_context_definition() ELSE IF(lwm) THEN ; write(numout,*) 'Calling xios_close_context_definition()' ; CALL FLUSH(numout) ; ENDIF CALL xios_close_context_definition() IF(lwm) THEN ; write(numout,*) 'Returned from xios_close_context_definition()' ; CALL FLUSH(numout) ; ENDIF IF(lwm) THEN ; write(numout,*) 'Calling xios_update_calendar( 0 )' ; CALL FLUSH(numout) ; ENDIF CALL xios_update_calendar( 0 ) IF(lwm) THEN ; write(numout,*) 'Returned from xios_update_calendar( 0 )' ; CALL FLUSH(numout) ; ENDIF ENDIF
comment:8 Changed 8 months ago by acc
Ok, There are two rogues out of the 32 ocean cores:
for f in ocean.output*; do echo -n $f " : "; tail -1l $f; done ocean.output : Calling xios_update_calendar( 0 ) ocean.output_0001 : IOM closed def ocean.output_0002 : IOM closed def ocean.output_0003 : IOM closed def ocean.output_0004 : IOM closed def ocean.output_0005 : IOM closed def ocean.output_0006 : IOM closed def ocean.output_0007 : IOM closed def ocean.output_0008 : IOM closed def ocean.output_0009 : IOM closed def ocean.output_0010 : IOM closed def ocean.output_0011 : IOM closed def ocean.output_0012 : IOM closed def ocean.output_0013 : IOM closed def ocean.output_0014 : IOM closed def ocean.output_0015 : IOM closed def ocean.output_0016 : Calling xios_update_calendar( 0 ) ocean.output_0017 : IOM closed def ocean.output_0018 : IOM closed def ocean.output_0019 : IOM closed def ocean.output_0020 : IOM closed def ocean.output_0021 : IOM closed def ocean.output_0022 : IOM closed def ocean.output_0023 : IOM closed def ocean.output_0024 : IOM closed def ocean.output_0025 : IOM closed def ocean.output_0026 : IOM closed def ocean.output_0027 : IOM closed def ocean.output_0028 : IOM closed def ocean.output_0029 : IOM closed def ocean.output_0030 : IOM closed def ocean.output_0031 : IOM closed def
Does this give any clues?
comment:9 Changed 8 months ago by acc
Looks like there is still some dodgy MPI going on. The 2-node case above can be made to run if I switch network protocols and mpich libraries in the run script:
module swap craype-network-ofi craype-network-ucx module swap cray-mpich cray-mpich-ucx module load libfabric
despite these not being in the environment when either NEMO or xios3 were compiled.
Unfortunately the same trick doesn't help the larger, eORCA025 test which hangs in the call to xios_close_context_definition
IOM initialise IOM initialised dia_mlr_iom_init : IOM context setup for multiple-linear-regression ~~~~~~~~~~~~~~~~ diamlr: configuration not found or incomplete (field group 'diamlr_fields' and/or file group 'diamlr_files' and/or field 'diamlr_time' missing); disabling output for multiple-linear-regression analysis. IOM closing def Calling xios_close_context_definition()
with all 1019 ocean cores at the same point.
Some tips on where and how to instrument the internals of xios_close_context_definition would be useful.
comment:10 Changed 8 months ago by acc
Found a typo in the file_def xml for the eORCA025 case the effect of which was to fail to name the writer. Fixing this and swapping from OFI to UCX networks in the run-time environment has got the eORCA025 case running (albeit with just a single 2D field for now). It still seems too easy to break this in unfathomable ways but it is progress.
Successful tests so far:
- 1019 ocean cores ; 16 xios3 servers in 1 pool (1 writer) [multiple file o/p]
- 1019 ocean cores ; 16 xios3 servers in 2 pools (10 in gatherer pool , 6 in writer pool) [multiple file o/p]
- 1019 ocean cores ; 16 xios3 servers in 2 pools (15 in gatherer pool , 1 in writer pool)
comment:11 Changed 8 months ago by ymipsl
Hi Andrew,
So good news ! Like I said, we get different behaviour following the network layer used. It seems UCX will be the future and is generally more stable, but not at all. We also get some problems with Rget with OpenMPi+UCX using the one_sided protocol. For this reason I am developping a point to point transport protocol using more mature technologies, I hope.It works now on one of the the main supercomputer we used, I am debogging to find what is wrong on the second one (a message is corrupt with no evident reason).
For services, you right, like it is a little bit new, we dont't set enough safeguard to avoid this kind of error (If a service is not existing, it will waiting until be declared, or never, so we probably must add a timeout)
comment:12 Changed 8 months ago by ymipsl
The new point to point transport protocol seems to be stabilized.
Could you testing it, setting the xios variable :
transport_protocol = p2p
comment:13 Changed 8 months ago by acc
Yes, the new transport protocol seems to improve stability. With XIOS3 trunk@2558 and:
<variable id="transport_protocol" type="string" >p2p</variable>
my eORCA025 tests will run with both the default OFI network and the UCX framework. There is no measurable difference in performance but I've only done a few short tests and timing on the cluster is always variable anyway.
To summarise we now have:
network/mpich | transport protocol | status |
craype-network-ofi:cray-mpich/8.1.23 | legacy | hangs in xios_close_context_definition |
craype-network-ofi:cray-mpich/8.1.23 | p2p | runs to completion |
cray-ucx/2.7.0-1:craype-network-ucx:cray-mpich-ucx/8.1.23 | legacy | runs to completion |
cray-ucx/2.7.0-1:craype-network-ucx:cray-mpich-ucx/8.1.23 | p2p | runs to completion |
There are however lots of warnings in the UCX slurm o/p (with p2p) which may have nothing to do with XIOS (but don't encourage trust):
grep rcache.c slurm-4388959.out | wc -l 4140 grep mpool.c slurm-4388959.out | wc -l 1904256 grep mm_xpmem slurm-4388959.out | wc -l 119097
Typical examples of these warning messages:
[1694028823.456409] [nid001511:101564:0] mm_xpmem.c:86 UCX WARN remote segment id 200018d01 apid 4600018cbc is not released, refcount 2 [1694028823.391433] [nid001417:150944:0] mpool.c:43 UCX WARN object 0x34a20c0 was not returned to mpool ucp_rkeys [1694028823.445153] [nid001511:101564:0] rcache.c:512 UCX WARN mlx5_0: destroying inuse region 0xe047950 [0x14eec3e3e000..0x14eec3e3e010] g- rw ref 1 lkey 0xa98655 rkey 0xa98655 atomic_rkey 0xffffffff
As a consequence of these warnings, the log files are 250MB in size. Legacy protocol runs with UCX do not produce these warnings. The OFI-p2p runs produce much smaller logs (1.5MB) (but still 3x the size of the UCX-legacy logs). The issue here seems to be at the end of the run with lots of:
Server Context destructor Server Context destructor Server Context destructor MPICH ERROR [Rank 282] [job id 4393814.0] [Thu Sep 7 09:54:05 2023] [nid001523] - Abort(806969871) (rank 282 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack: PMPI_Finalize(214)...............: MPI_Finalize failed PMPI_Finalize(161)...............: MPID_Finalize(710)...............: MPIDI_OFI_mpi_finalize_hook(1016): OFI domain close failed (ofi_init.c:1016:MPIDI_OFI_mpi_finalize_hook:Device or resource busy) aborting job: Fatal error in PMPI_Finalize: Other MPI error, error stack:
messages, despite ocean.output seemingly finished tidily and the o/p netcdf files all correct.
comment:14 Changed 7 months ago by acc
An update:
- UCX warnings on the Cray can be suppressed with:
export UCX_LOG_LEVEL=FATAL
in the run script. Hopefully, ignoring them is the way to go.
- The Cray still has issues with longer runs (an eORCA025 full config. runs out of memory after 11 months) but this is just as likely to be a Cray issue with a poor UCX implementation (see: https://docs.archer2.ac.uk/known-issues/#excessive-memory-use-when-using-ucx-communications-protocol-added-2023-07-20). Unfortunately, HPE's current "solution" is to use OFI -which fails even earlier with an undiagnosed segmentation violation.
- The exact same setup runs happily on our INTEL cluster. Or rather it was running fine, changes at revision 2566 seem to have broken it. The generic test case examples still run but the eORCA025 config is now back to hanging at start-up. I'll attach a picture of what was running quite successfully.
This is all using the p2p transport protocol
comment:15 Changed 7 months ago by acc
comment:16 Changed 7 months ago by jderouillat
Hi, can you confirm that it was ok with 2625 ?
comment:17 Changed 7 months ago by acc
It was fine at 2565.
comment:18 Changed 7 months ago by jderouillat
Could you try to revert the part below :
=================================================================== --- XIOS3/trunk/src/transport/p2p_context_server.cpp 2023-09-14 07:06:45 UTC (rev 2565) +++ XIOS3/trunk/src/transport/p2p_context_server.cpp 2023-09-14 07:32:36 UTC (rev 2566) @@ -97,17 +97,19 @@ { traceOff(); MPI_Iprobe(MPI_ANY_SOURCE, 20,interCommMerged_, &flag, &status); + traceOn(); if (flag==true) { int rank=status.MPI_SOURCE ; - requests_[rank].push_back(new CRequest(interCommMerged_, status)) ; + auto& rankRequests = requests_[rank]; + rankRequests.push_back(new CRequest(interCommMerged_, status)) ; // Test 1st request of the list, request treatment must be ordered - if (requests_[rank].front()->test()) + if (rankRequests.front()->test()) { - processRequest( *(requests_[rank].front()) ); - delete requests_[rank].front(); - requests_[rank].pop_front() ; + processRequest( *(rankRequests.front()) ); + delete rankRequests.front(); + rankRequests.pop_front() ;
comment:19 Changed 7 months ago by acc
Apologies, this may have been a false alarm. I updated to 2566 ready to try your suggestion and thought it best to check it was still playing up. However, it now runs without having to revert the code. I'll check a few more times with this (and the HEAD) but it looks to have been a local issue with our cluster.
comment:20 Changed 7 months ago by acc
..or maybe not. Revision 2566 appears to be ok . But the current HEAD (rev 2569) is hanging -three attempts, all hang. Probably too late on a Friday to be definitive. Will confirm next week.
comment:21 Changed 7 months ago by jderouillat
Indeed the commit 2569 is hanging with many services, this is something I am working on.
To continue your tests, you can comment the call to :
cleanSplitSchedulers();
comment:22 Changed 7 months ago by jderouillat
line 94 of src/event_scheduler.cpp
comment:23 Changed 7 months ago by acc
Thanks. I'll stick with 2566 for now (or is the:
stack_.clear();
line an important addition?)
comment:24 Changed 7 months ago by jderouillat
You can keep 2566 for now, the stack_.clear() is just a part of the final cleanup.
I've just commited a new implementation of the cleanSplitSchedulers() which produces the hang. I tested it on my own, but not in a case as details as yours (which I'll try to mimic). If you have the opportunity to test it, I'll be grateful.
comment:25 Changed 7 months ago by acc
Yes, can confirm that 2570 no longer hangs. However, my model now blows up after 10 timesteps with some NANs appearing so it looks like messages are getting corrupted. The same code and inputs still works with xios3@2566.
First error in the slurm log is:
==== backtrace (tid: 39343) ==== 0 0x000000000021b780 MPIDIU_comm_rank_to_av() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_proc.h:105 1 0x000000000021b780 MPIDI_OFI_comm_to_phys_vci() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:513 2 0x000000000021b780 MPIDI_OFI_comm_to_phys() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:519 3 0x000000000021b780 MPIDI_OFI_am_fetch_incr_send_seqno() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am_impl.h:22 4 0x000000000021b780 MPIDI_OFI_do_inject() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am_impl.h:543 5 0x000000000021b780 MPIDI_NM_am_send_hdr_reply() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am.h:158 6 0x000000000021b780 win_unlock_proc() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_rma_target_callbacks.c:534 7 0x000000000021b780 MPIDIG_win_ctrl_target_msg_cb() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_rma_target_callbacks.c:1430 8 0x00000000005f7891 MPIDI_OFI_handle_short_am_hdr() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am_events.h:194 9 0x00000000005f7891 am_recv_event() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_events.c:680 10 0x00000000005ee099 MPIDI_OFI_dispatch_function() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_events.c:822 11 0x00000000005ed160 MPIDI_OFI_handle_cq_entries() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_events.c:952 12 0x000000000060e9e8 MPIDI_OFI_progress() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_progress.c:40 13 0x00000000004be475 MPIDI_OFI_do_iprobe..0() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_probe.h:105 14 0x00000000004bfc09 MPIDI_NM_mpi_iprobe() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_probe.h:204 15 0x00000000004bfc09 MPIDI_iprobe_handoff() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:59 16 0x00000000004bfc09 MPIDI_iprobe_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:88 17 0x00000000004bfc09 MPIDI_iprobe_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:244 18 0x00000000004bfc09 MPID_Iprobe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:394 19 0x00000000004bfc09 PMPI_Iprobe() /build/impi/_buildspace/release/../../src/mpi/pt2pt/iprobe.c:110 20 0x00000000011d3663 xios::CRessourcesManager::eventLoop() ???:0 21 0x000000000099e254 xios::CDaemonsManager::eventLoop() ???:0 22 0x00000000011e3a83 xios::CServer::initialize() ???:0 23 0x0000000000997db6 xios::CXios::initServerSide() ???:0 24 0x0000000000458a5b MAIN__() ???:0 25 0x0000000001414316 main() ???:0 26 0x0000000000022555 __libc_start_main() ???:0 27 0x0000000000458969 _start() ???:0 ================================= forrtl: severe (174): SIGSEGV, segmentation fault occurred
comment:26 Changed 7 months ago by acc
Oops, sorry false alarm. I set up a test version, to avoid overwriting the working version, and had a different set of compile keys. The correctly compiled version works identically at 2566 and 2570.
comment:27 Changed 7 months ago by acc
...actually, not quite identically. Both complete all the requested 744 timesteps (1 month) and run.stat files are identical but the version using 2570 crashes out at the end before all the monthly means are written. The backtrace looks like that above, which probably just means parts have aborted. No indication in nemo output of any issues. grid_W and icemod monthly means are complete but the larger grid_T, grid_U and grid_V are not.
comment:28 Changed 7 months ago by jderouillat
I found some problems revealed using IntelMPI, I revert temporary this part.
It will be interesting to get the stack trace if possible to have an idea of where it hangs.
Of course a smaller job will be better : less clients an less servers. Maybe orca025 is a little bit big to start. Could you swich on smaller config.
Do you have already a working config using XIOS3 without managing pool and services manually ?
On our sided, we succced to run NEMO on our computers, and also in coupled way. But we got a lot of hardware trouble like a said in #189, depending of the internal MPI protocol and harware network used. It is a real jungle when trying to use one_sided communication based on new features of MPI3 standard (dynamics windows, MPI_RGet, etc...). So I am currently developping an additionnal transport protocol using only p2p comm.