Opened 8 months ago

Last modified 7 months ago

#190 new defect

XIOS3-trunk with NEMO - Hangs at writing step (legacy) or 'Wrong window flavor' error (one-sided)

Reported by: acc Owned by: ymipsl
Priority: major Component: XIOS
Version: trunk Keywords: XIOS3
Cc:

Description

Following on from #189, I'm struggling with a first attempt at using XIOS3 pools and services with a production-style NEMO configuration. The set up is an eORCA025 with 1019 cores and 16 servers (10 gatherers and 6 writers (in multiple_file mode)). Output is just a 2D field for now at daily frequency.

With default comms (legacy), the model runs to the end of the first day but then hangs without producing any output. With one-sided comms activated the model doesn't get past the initialisation (probably the close context point) and produces these errors:

MPICH ERROR [Rank 874] [job id 4358547.0] [Fri Sep  1 10:11:59 2023] [nid004313] - Abort(210908474) (rank 874 in comm 0): Fatal error in PMPI_Win_attach: Incorrect window flavor, error stack:
PMPI_Win_attach(139)......: MPI_Win_attach(win=0x10, base=0x18b4d000, size=-536870867) failed
MPICH ERROR [Rank 961] [job id 4358547.0] [Fri Sep  1 10:11:59 2023] [nid004314] - Abort(546452794) (rank 961 in comm 0): Fatal error in PMPI_Win_attach: Incorrect window flavor, error stack:
MPID_Win_attach(110)......:
PMPI_Win_attach(139)......: MPI_Win_attach(win=0x10, base=0x1862c400, size=-536870867) failed
MPIDIG_mpi_win_attach(891): Incorrect window flavor
MPID_Win_attach(110)......:

MPIDIG_mpi_win_attach(891): Incorrect window flavor
aborting job:

This is on the same Cray cluster and using the same compilers and environment that were successful in #189. I'll step back and try again with a smaller NEMO example.

Attachments (1)

xios3_o25ex.png (583.8 KB) - added by acc 7 months ago.
eORCA025 1 pool many services example

Download all attachments as: .zip

Change History (29)

comment:1 Changed 8 months ago by ymipsl

It will be interesting to get the stack trace if possible to have an idea of where it hangs.
Of course a smaller job will be better : less clients an less servers. Maybe orca025 is a little bit big to start. Could you swich on smaller config.

Do you have already a working config using XIOS3 without managing pool and services manually ?

On our sided, we succced to run NEMO on our computers, and also in coupled way. But we got a lot of hardware trouble like a said in #189, depending of the internal MPI protocol and harware network used. It is a real jungle when trying to use one_sided communication based on new features of MPI3 standard (dynamics windows, MPI_RGet, etc...). So I am currently developping an additionnal transport protocol using only p2p comm.

comment:2 Changed 8 months ago by acc

A few observations from an ORCA2_ICE_PISCES configuration (32 ocean cores + 4 servers).

  • Yes it runs as normal with XIOS3 libraries and XIOS3 xios_server.exe and a XIOS2-style iodef.xml:
    <?xml version="1.0"?>
    <simulation>
    
    <!-- ============================================================================================ -->
    <!-- XIOS context                                                                                 -->
    <!-- ============================================================================================ -->
    
    <!-- XIOS2 -->
      <context id="xios" >
    
          <variable_definition>
    
              <variable id="info_level"                type="int">10</variable>
              <variable id="using_server"              type="bool">false</variable>
              <variable id="using_oasis"               type="bool">false</variable>
    <!--
              <variable id="oasis_codes_id"            type="string" >oceanx</variable>
    -->
    
          </variable_definition>
      </context>
    <!-- ============================================================================================ -->
    <!-- NEMO  CONTEXT add and suppress the components you need                                       -->
    <!-- ============================================================================================ -->
    
      <context id="nemo" src="./context_nemo.xml"/>       <!--  NEMO       -->
    
    </simulation>
    

(with one change because the unused oasis_codes_id is not recognised)

  • It also runs with a basic XIOS3 style (no pools or services):
    <?xml version="1.0"?>
    <simulation>
    
    <!-- ============================================================================================ -->
    <!-- XIOS context                                                                                 -->
    <!-- ============================================================================================ -->
    
    <!-- XIOS3 -->
      <context id="xios" >
        <variable_definition>
          <variable_group id="buffer">
            <variable id="min_buffer_size" type="int">250000</variable>
            <variable id="optimal_buffer_size" type="string">memory</variable>
          </variable_group>
    
          <variable_group id="parameters" >
            <variable id="using_server" type="bool">true</variable>
            <variable id="info_level" type="int">10</variable>
            <variable id="print_file" type="bool">false</variable>
            <variable id="using_server2" type="bool">false</variable>
            <variable id="ratio_server2" type="int">50</variable>
    
            <variable id="pure_one_sided" type="bool">false</variable>
            <variable id="check_event_sync" type="bool">false</variable>
            <variable id="using_oasis"      type="bool">false</variable>
          </variable_group>
        </variable_definition>
      </context>
    
    <!-- ============================================================================================ -->
    <!-- NEMO  CONTEXT add and suppress the components you need                                       -->
    <!-- ============================================================================================ -->
    
      <context id="nemo" src="./context_nemo.xml"/>       <!--  NEMO       -->
    
    </simulation>
    

but didn't finish cleanly and was left hanging after the final xios_finalize call. It stayed in this state until the time expired but all output files were complete. Adding a MPI_FINALIZE call (actually call mppstop) after xios_finalize in nemogcm.F90 fixes this. Have the internals of xios_finalize changed? i.e. did it used to perform a MPI_FINALIZE but no longer does?

  • Taking the same (working) setup and assigning a pool with all the servers also works. Only differences in the iodef.xml between this and the previous being:
    • .xml

      diff -u iodef_nopool.xml iodef_pool.xml
      old new  
      2424        <variable id="using_oasis"      type="bool">false</variable> 
      2525      </variable_group> 
      2626    </variable_definition> 
       27    <pool_definition> 
       28     <pool name="pool_ocean" nprocs="4"> 
       29      <service name="owriter" nprocs="4" type="writer"/> 
       30     </pool> 
       31    </pool_definition> 
      2732  </context> 
      2833 
       34 
      2935<!-- ============================================================================================ --> 
      3036<!-- NEMO  CONTEXT add and suppress the components you need                                       --> 
      3137<!-- ============================================================================================ --> 
      3238 
      33   <context id="nemo" src="./context_nemo.xml"/>       <!--  NEMO       --> 
       39  <context id="nemo" default_pool_writer="pool_ocean" src="./context_nemo.xml"/>       <!--  NEMO       --> 
      3440 
      3541</simulation> 

and additional entries in all the <file id settings, e.g.:

  • file_def_nemo-oce.xml

    diff -u Original/file_def_nemo-oce.xml  file_def_nemo-oce.xml
    old new  
    1111    <file_definition type="one_file" name="@expname@_@freq@_@startdate@_@enddate@" sync_freq="1mo" min_digits="4"> 
    1212 
    1313      <file_group id="5d" output_freq="5d"  output_level="10" enabled=".TRUE.">  <!-- 5d files --> 
    14         <file id="file11" name_suffix="_grid_T" description="ocean T grid variables" > 
     14        <file id="file11" name_suffix="_grid_T" mode="write" writer="owriter" description="ocean T grid variables" > 
    1515          <field field_ref="e3t"      /> 

Something to build on.

Last edited 8 months ago by acc (previous) (diff)

comment:3 Changed 8 months ago by ymipsl

<context id="nemo" default_pool_writer="owriter" src="./context_nemo.xml"/>

seems me incorrect since owriter is a service, not a pool.

Would you mean :

<context id="nemo" default_pool_writer="pool_ocean" src="./context_nemo.xml"/>

comment:4 Changed 8 months ago by acc

Yes, that was a cut and paste error. Already corrected.

comment:5 Changed 8 months ago by ymipsl

Ok, great. So let testing it with gatherers and writers ?

For the xios finalize problem, XIOS call MPI_finalize only if MPI was not initialized before the xios_initialize call (ie XIOS call MPI_Finalize only if it has called himself MPI_Init, which is done only if MPI is not initialized before).

Currently, what is done in your NEMO version ?

comment:6 Changed 8 months ago by acc

Final test today: Splitting the pool into 2 writer services (owriter1 and owriter2, each with 2 procs) and assigning different files to different services also works. For example, sending 5day means to one service and monthly means to the other.

The MPI_Finalize issue was my fault. I had made local mods when trying to solve issues with the earlier version of XIOS3 and that included adding a MPI_INIT outside of xios_initialize. Neither that nor my addition mppstop is now required.

Next week, I'll check ORCA2 is still happy when run on more than one node and then re-try the eORCA025 with just writers before adding in the complicataion of gatherers.

comment:7 Changed 8 months ago by acc

Ok, tried the ORCA2_ICE_PISCES config over 2 nodes and it exhibits the same hanging behaviour as the larger config (regardless of transport protocol). If I disable the file_def files in the context_nemo.xml then it runs (but, obviously, doesn't create any output via xios). The same is true of the larger config spread over many more nodes.

I've put some print statements in with flushes and the code (with the file_defs reinstated) never returns from the call to iom_init_closedef at the top of step. Interestingly, it isn't the xios_close_context_definition() call that is hanging (at least, for the reporting core) but rather the xios_update_calendar call that follows:

 IOM closing def
 Calling xios_close_context_definition()
 Returned from xios_close_context_definition()
 Calling xios_update_calendar( 0 )

I'll need to verify this for all cores. At least that is manageable for the ORCA2 config.

The print statements currently are in stpmlf.F90:

      IF( kstp == nit000 ) THEN                       ! initialize IOM context (must be done after nemo_init for AGRIF+XIOS+OASIS)
                             IF(lwm) THEN ; write(numout,*) 'IOM initialise ' ; CALL FLUSH(numout) ; ENDIF
                             CALL iom_init( cxios_context, ld_closedef=.FALSE. )   ! for model grid (including possible AGRIF zoom)
                             IF(lwm) THEN ; write(numout,*) 'IOM initialised' ; CALL FLUSH(numout) ; ENDIF
         IF( lk_diamlr   )   CALL dia_mlr_iom_init    ! with additional setup for multiple-linear-regression analysis
                             IF(lwm) THEN ; write(numout,*) 'IOM closing def ' ; CALL FLUSH(numout) ; ENDIF
                             CALL iom_init_closedef
         IF( ln_crs      )   CALL iom_init( TRIM(cxios_context)//"_crs" )  ! for coarse grid
                             IF(lwm) THEN ; write(numout,*) 'IOM closed def ' ; CALL FLUSH(numout) ; ENDIF
      ENDIF

and iom.F90:

   SUBROUTINE iom_init_closedef(cdname)
      !!----------------------------------------------------------------------
      !!            ***  SUBROUTINE iom_init_closedef  ***
      !!----------------------------------------------------------------------
      !!
      !! ** Purpose : Closure of context definition
      !!
      !!----------------------------------------------------------------------
      CHARACTER(len=*), OPTIONAL, INTENT(IN) :: cdname
#if defined key_xios
      LOGICAL :: llrstw

      llrstw = .FALSE.
      IF(PRESENT(cdname)) THEN
         llrstw = (cdname == cw_ocerst_cxt)
         llrstw = llrstw .OR. (cdname == cw_icerst_cxt)
         llrstw = llrstw .OR. (cdname == cw_ablrst_cxt)
         llrstw = llrstw .OR. (cdname == cw_toprst_cxt)
         llrstw = llrstw .OR. (cdname == cw_sedrst_cxt)
      ENDIF

      IF( llrstw ) THEN
!set names of the fields in restart file IF using XIOS to write data
         CALL iom_set_rst_context(.FALSE.)
         CALL xios_close_context_definition()
      ELSE
         IF(lwm) THEN ; write(numout,*) 'Calling xios_close_context_definition()' ; CALL FLUSH(numout) ; ENDIF
         CALL xios_close_context_definition()
         IF(lwm) THEN ; write(numout,*) 'Returned from xios_close_context_definition()' ; CALL FLUSH(numout) ; ENDIF
         IF(lwm) THEN ; write(numout,*) 'Calling xios_update_calendar( 0 )' ; CALL FLUSH(numout) ; ENDIF
         CALL xios_update_calendar( 0 )
         IF(lwm) THEN ; write(numout,*) 'Returned from xios_update_calendar( 0 )' ; CALL FLUSH(numout) ; ENDIF
      ENDIF

comment:8 Changed 8 months ago by acc

Ok, There are two rogues out of the 32 ocean cores:

for f in ocean.output*; do echo -n $f "  : "; tail -1l $f; done
ocean.output   :  Calling xios_update_calendar( 0 )
ocean.output_0001   :  IOM closed def
ocean.output_0002   :  IOM closed def
ocean.output_0003   :  IOM closed def
ocean.output_0004   :  IOM closed def
ocean.output_0005   :  IOM closed def
ocean.output_0006   :  IOM closed def
ocean.output_0007   :  IOM closed def
ocean.output_0008   :  IOM closed def
ocean.output_0009   :  IOM closed def
ocean.output_0010   :  IOM closed def
ocean.output_0011   :  IOM closed def
ocean.output_0012   :  IOM closed def
ocean.output_0013   :  IOM closed def
ocean.output_0014   :  IOM closed def
ocean.output_0015   :  IOM closed def
ocean.output_0016   :  Calling xios_update_calendar( 0 )
ocean.output_0017   :  IOM closed def
ocean.output_0018   :  IOM closed def
ocean.output_0019   :  IOM closed def
ocean.output_0020   :  IOM closed def
ocean.output_0021   :  IOM closed def
ocean.output_0022   :  IOM closed def
ocean.output_0023   :  IOM closed def
ocean.output_0024   :  IOM closed def
ocean.output_0025   :  IOM closed def
ocean.output_0026   :  IOM closed def
ocean.output_0027   :  IOM closed def
ocean.output_0028   :  IOM closed def
ocean.output_0029   :  IOM closed def
ocean.output_0030   :  IOM closed def
ocean.output_0031   :  IOM closed def

Does this give any clues?

comment:9 Changed 8 months ago by acc

Looks like there is still some dodgy MPI going on. The 2-node case above can be made to run if I switch network protocols and mpich libraries in the run script:

module swap craype-network-ofi craype-network-ucx
module swap cray-mpich cray-mpich-ucx
module load libfabric

despite these not being in the environment when either NEMO or xios3 were compiled.

Unfortunately the same trick doesn't help the larger, eORCA025 test which hangs in the call to xios_close_context_definition

 IOM initialise
 IOM initialised

 dia_mlr_iom_init : IOM context setup for multiple-linear-regression
 ~~~~~~~~~~~~~~~~
 diamlr: configuration not found or incomplete (field group 'diamlr_fields'
         and/or file group 'diamlr_files' and/or field 'diamlr_time' missing);
         disabling output for multiple-linear-regression analysis.
 IOM closing def
 Calling xios_close_context_definition()

with all 1019 ocean cores at the same point.

Some tips on where and how to instrument the internals of xios_close_context_definition would be useful.

comment:10 Changed 8 months ago by acc

Found a typo in the file_def xml for the eORCA025 case the effect of which was to fail to name the writer. Fixing this and swapping from OFI to UCX networks in the run-time environment has got the eORCA025 case running (albeit with just a single 2D field for now). It still seems too easy to break this in unfathomable ways but it is progress.

Successful tests so far:

  • 1019 ocean cores ; 16 xios3 servers in 1 pool (1 writer) [multiple file o/p]
  • 1019 ocean cores ; 16 xios3 servers in 2 pools (10 in gatherer pool , 6 in writer pool) [multiple file o/p]
  • 1019 ocean cores ; 16 xios3 servers in 2 pools (15 in gatherer pool , 1 in writer pool)

comment:11 Changed 8 months ago by ymipsl

Hi Andrew,

So good news ! Like I said, we get different behaviour following the network layer used. It seems UCX will be the future and is generally more stable, but not at all. We also get some problems with Rget with OpenMPi+UCX using the one_sided protocol. For this reason I am developping a point to point transport protocol using more mature technologies, I hope.It works now on one of the the main supercomputer we used, I am debogging to find what is wrong on the second one (a message is corrupt with no evident reason).

For services, you right, like it is a little bit new, we dont't set enough safeguard to avoid this kind of error (If a service is not existing, it will waiting until be declared, or never, so we probably must add a timeout)

comment:12 Changed 8 months ago by ymipsl

The new point to point transport protocol seems to be stabilized.

Could you testing it, setting the xios variable :

transport_protocol = p2p

comment:13 Changed 8 months ago by acc

Yes, the new transport protocol seems to improve stability. With XIOS3 trunk@2558 and:

        <variable id="transport_protocol" type="string" >p2p</variable>

my eORCA025 tests will run with both the default OFI network and the UCX framework. There is no measurable difference in performance but I've only done a few short tests and timing on the cluster is always variable anyway.

To summarise we now have:

network/mpich transport protocol status
craype-network-ofi:cray-mpich/8.1.23 legacy hangs in xios_close_context_definition
craype-network-ofi:cray-mpich/8.1.23 p2p runs to completion
cray-ucx/2.7.0-1:craype-network-ucx:cray-mpich-ucx/8.1.23 legacy runs to completion
cray-ucx/2.7.0-1:craype-network-ucx:cray-mpich-ucx/8.1.23 p2p runs to completion

There are however lots of warnings in the UCX slurm o/p (with p2p) which may have nothing to do with XIOS (but don't encourage trust):

grep rcache.c slurm-4388959.out | wc -l
4140
grep mpool.c slurm-4388959.out | wc -l
1904256
grep mm_xpmem slurm-4388959.out | wc -l
119097

Typical examples of these warning messages:

[1694028823.456409] [nid001511:101564:0]       mm_xpmem.c:86   UCX  WARN  remote segment id 200018d01 apid 4600018cbc is not released, refcount 2

[1694028823.391433] [nid001417:150944:0]          mpool.c:43   UCX  WARN  object 0x34a20c0 was not returned to mpool ucp_rkeys

[1694028823.445153] [nid001511:101564:0]         rcache.c:512  UCX  WARN  mlx5_0: destroying inuse region 0xe047950 [0x14eec3e3e000..0x14eec3e3e010] g- rw ref 1 lkey 0xa98655 rkey 0xa98655 atomic_rkey 0xffffffff

As a consequence of these warnings, the log files are 250MB in size. Legacy protocol runs with UCX do not produce these warnings. The OFI-p2p runs produce much smaller logs (1.5MB) (but still 3x the size of the UCX-legacy logs). The issue here seems to be at the end of the run with lots of:

Server Context destructor
Server Context destructor
Server Context destructor
MPICH ERROR [Rank 282] [job id 4393814.0] [Thu Sep  7 09:54:05 2023] [nid001523] - Abort(806969871) (rank 282 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(214)...............: MPI_Finalize failed
PMPI_Finalize(161)...............:
MPID_Finalize(710)...............:
MPIDI_OFI_mpi_finalize_hook(1016): OFI domain close failed (ofi_init.c:1016:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)

aborting job:
Fatal error in PMPI_Finalize: Other MPI error, error stack:

messages, despite ocean.output seemingly finished tidily and the o/p netcdf files all correct.

comment:14 Changed 7 months ago by acc

An update:

  1. UCX warnings on the Cray can be suppressed with:
export UCX_LOG_LEVEL=FATAL

in the run script. Hopefully, ignoring them is the way to go.

  1. The Cray still has issues with longer runs (an eORCA025 full config. runs out of memory after 11 months) but this is just as likely to be a Cray issue with a poor UCX implementation (see: https://docs.archer2.ac.uk/known-issues/#excessive-memory-use-when-using-ucx-communications-protocol-added-2023-07-20). Unfortunately, HPE's current "solution" is to use OFI -which fails even earlier with an undiagnosed segmentation violation.
  1. The exact same setup runs happily on our INTEL cluster. Or rather it was running fine, changes at revision 2566 seem to have broken it. The generic test case examples still run but the eORCA025 config is now back to hanging at start-up. I'll attach a picture of what was running quite successfully.

This is all using the p2p transport protocol

Changed 7 months ago by acc

eORCA025 1 pool many services example

comment:15 Changed 7 months ago by acc

eORCA025 1 pool many services example
A more testing example: eORCA025 running on 1019 cores with 1 pool of 52 servers divided into 14 services. This arrangement is allowing robust and efficient single file output for all 5-day, monthly and annual means.

comment:16 Changed 7 months ago by jderouillat

Hi, can you confirm that it was ok with 2625 ?

comment:17 Changed 7 months ago by acc

It was fine at 2565.

Last edited 7 months ago by acc (previous) (diff)

comment:18 Changed 7 months ago by jderouillat

Could you try to revert the part below :

===================================================================
--- XIOS3/trunk/src/transport/p2p_context_server.cpp	2023-09-14 07:06:45 UTC (rev 2565)
+++ XIOS3/trunk/src/transport/p2p_context_server.cpp	2023-09-14 07:32:36 UTC (rev 2566)
@@ -97,17 +97,19 @@
     {
       traceOff();
       MPI_Iprobe(MPI_ANY_SOURCE, 20,interCommMerged_, &flag, &status);
+
       traceOn();
       if (flag==true)
       {
         int rank=status.MPI_SOURCE ;
-        requests_[rank].push_back(new CRequest(interCommMerged_, status)) ;
+        auto& rankRequests = requests_[rank];
+        rankRequests.push_back(new CRequest(interCommMerged_, status)) ;
         // Test 1st request of the list, request treatment must be ordered 
-        if (requests_[rank].front()->test())
+        if (rankRequests.front()->test())
         {
-          processRequest( *(requests_[rank].front()) );
-          delete requests_[rank].front();
-          requests_[rank].pop_front() ;
+          processRequest( *(rankRequests.front()) );
+          delete rankRequests.front();
+          rankRequests.pop_front() ;

comment:19 Changed 7 months ago by acc

Apologies, this may have been a false alarm. I updated to 2566 ready to try your suggestion and thought it best to check it was still playing up. However, it now runs without having to revert the code. I'll check a few more times with this (and the HEAD) but it looks to have been a local issue with our cluster.

comment:20 Changed 7 months ago by acc

..or maybe not. Revision 2566 appears to be ok . But the current HEAD (rev 2569) is hanging -three attempts, all hang. Probably too late on a Friday to be definitive. Will confirm next week.

comment:21 Changed 7 months ago by jderouillat

Indeed the commit 2569 is hanging with many services, this is something I am working on.
To continue your tests, you can comment the call to :

cleanSplitSchedulers();

comment:22 Changed 7 months ago by jderouillat

line 94 of src/event_scheduler.cpp

comment:23 Changed 7 months ago by acc

Thanks. I'll stick with 2566 for now (or is the:

 stack_.clear(); 

line an important addition?)

comment:24 Changed 7 months ago by jderouillat

You can keep 2566 for now, the stack_.clear() is just a part of the final cleanup.

I've just commited a new implementation of the cleanSplitSchedulers() which produces the hang. I tested it on my own, but not in a case as details as yours (which I'll try to mimic). If you have the opportunity to test it, I'll be grateful.

comment:25 Changed 7 months ago by acc

Yes, can confirm that 2570 no longer hangs. However, my model now blows up after 10 timesteps with some NANs appearing so it looks like messages are getting corrupted. The same code and inputs still works with xios3@2566.

First error in the slurm log is:

==== backtrace (tid:  39343) ====
 0 0x000000000021b780 MPIDIU_comm_rank_to_av()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_proc.h:105
 1 0x000000000021b780 MPIDI_OFI_comm_to_phys_vci()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:513
 2 0x000000000021b780 MPIDI_OFI_comm_to_phys()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:519
 3 0x000000000021b780 MPIDI_OFI_am_fetch_incr_send_seqno()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am_impl.h:22
 4 0x000000000021b780 MPIDI_OFI_do_inject()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am_impl.h:543
 5 0x000000000021b780 MPIDI_NM_am_send_hdr_reply()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am.h:158
 6 0x000000000021b780 win_unlock_proc()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_rma_target_callbacks.c:534
 7 0x000000000021b780 MPIDIG_win_ctrl_target_msg_cb()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_rma_target_callbacks.c:1430
 8 0x00000000005f7891 MPIDI_OFI_handle_short_am_hdr()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_am_events.h:194
 9 0x00000000005f7891 am_recv_event()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_events.c:680
10 0x00000000005ee099 MPIDI_OFI_dispatch_function()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_events.c:822
11 0x00000000005ed160 MPIDI_OFI_handle_cq_entries()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_events.c:952
12 0x000000000060e9e8 MPIDI_OFI_progress()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_progress.c:40
13 0x00000000004be475 MPIDI_OFI_do_iprobe..0()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_probe.h:105
14 0x00000000004bfc09 MPIDI_NM_mpi_iprobe()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_probe.h:204
15 0x00000000004bfc09 MPIDI_iprobe_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:59
16 0x00000000004bfc09 MPIDI_iprobe_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:88
17 0x00000000004bfc09 MPIDI_iprobe_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:244
18 0x00000000004bfc09 MPID_Iprobe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:394
19 0x00000000004bfc09 PMPI_Iprobe()  /build/impi/_buildspace/release/../../src/mpi/pt2pt/iprobe.c:110
20 0x00000000011d3663 xios::CRessourcesManager::eventLoop()  ???:0
21 0x000000000099e254 xios::CDaemonsManager::eventLoop()  ???:0
22 0x00000000011e3a83 xios::CServer::initialize()  ???:0
23 0x0000000000997db6 xios::CXios::initServerSide()  ???:0
24 0x0000000000458a5b MAIN__()  ???:0
25 0x0000000001414316 main()  ???:0
26 0x0000000000022555 __libc_start_main()  ???:0
27 0x0000000000458969 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred

comment:26 Changed 7 months ago by acc

Oops, sorry false alarm. I set up a test version, to avoid overwriting the working version, and had a different set of compile keys. The correctly compiled version works identically at 2566 and 2570.

comment:27 Changed 7 months ago by acc

...actually, not quite identically. Both complete all the requested 744 timesteps (1 month) and run.stat files are identical but the version using 2570 crashes out at the end before all the monthly means are written. The backtrace looks like that above, which probably just means parts have aborted. No indication in nemo output of any issues. grid_W and icemod monthly means are complete but the larger grid_T, grid_U and grid_V are not.

comment:28 Changed 7 months ago by jderouillat

I found some problems revealed using IntelMPI, I revert temporary this part.

Note: See TracTickets for help on using tickets.