Users (#2) - [AGRIF] Intel build (with key_agrif) based on ORCA2_LIM seg faults (#35) - Message List

[AGRIF] Intel build (with key_agrif) based on ORCA2_LIM seg faults
 unsolved

Hello,

I'm trying to run Intel-compiled NEMO using key_agrif with a configuration based on ORCA2_LIM.

Unfortunately, the run seg faults after around 25 minutes.

An error is also reported in the 1_ocean.output file; it reads

stpctl: the zonal velocity is larger than 20 m/s

and below that there is a line that reads

kt=  2353 max abs(U):   151.5    , i j k:   157   70    1

Do you think this error is causing the segmentation faults?

Below is a full description of my setup, I'm running on ARCHER, a Cray XC-30.

First off, I downloaded these versions of XIOS and NEMO.

svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0
svn --username 'mbareford' co -r 9301 http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM

Next, I setup the following module environment on ARCHER.

module swap PrgEnv-cray PrgEnv-intel/5.2.82
module load cray-hdf5-parallel/1.10.0.1
module load cray-netcdf-hdf5parallel/4.4.1.1
module load boost/1.60

The PrgEnv-intel module loads intel/17.0.0.098.

I then compiled XIOS and then NEMO, below is the make command for NEMO.

./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_INTEL -m XC_ARCHER_INTEL add_key "key_agrif"

The machine configuration specified by XC_ARCHER_INTEL ensures that the build is linked to XIOS v1.0 libraries. The compile flags are as follows.

-integer-size 32 -real-size 64 -g -O0 -fp-model source -zero -fpp -warn all

After building NEMO, I populated the EXP00 folder with the files archived in ​ORCA2_LIM_nemo_v3.6.tar.

Finally, I'm running NEMO in unattached mode (using_server=false) and so the aprun command is as follows.

...
#PBS -l select=4
...
aprun -n 24 -N 6 $NEMO_PATH/nemo.exe

The submission script selects 4 ARCHER nodes, there are 24 cores per node.

Tree View Flat View (newer first) Flat View (older first)
  • Message #56

    Just a side comment, it is not the first time I edit your messages in order to improve the web rendering.
    The readability of the message can help to identify the different parts and the important informations of your message.

    I suggest you to give a look at the following pages in order to learn a little of the Trac wiki syntax: Wiki Formatting | Wiki Processors.
    Or you can click on 'Edit' to see how I modified your text.

  • Message #57

    Thank you for editing my last post. I will endeavour to make use of the Trac wiki syntax in future.

    I've recompiled NEMO with the -traceback option so as to track the source of the seg faults. The standard output file now contains a stack trace.

    forrtl: severe (174): SIGSEGV, segmentation fault occurred
    Image              PC                Routine            Line        Source
    nemo.exe           00000000020B3DF1  Unknown               Unknown  Unknown
    nemo.exe           00000000020B1F2B  Unknown               Unknown  Unknown
    nemo.exe           00000000020625A4  Unknown               Unknown  Unknown
    nemo.exe           00000000020623B6  Unknown               Unknown  Unknown
    nemo.exe           0000000001FEF404  Unknown               Unknown  Unknown
    nemo.exe           0000000001FF27B0  Unknown               Unknown  Unknown
    nemo.exe           0000000001D33AD0  Unknown               Unknown  Unknown
    nemo.exe           0000000000AED2B5  histcom_mp_histwr        2017  histcom.f90
    nemo.exe           0000000000AEA9E1  histcom_mp_histw_        1872  histcom.f90
    nemo.exe           0000000000AE79C8  histcom_mp_histwr        1676  histcom.f90
    nemo.exe           0000000000C3C790  limwri_2_mp_sub_l         215  limwri_2.f90
    nemo.exe           0000000000C3B24C  limwri_2_mp_lim_w         143  limwri_2.f90
    nemo.exe           000000000082F2F3  diawri_mp_sub_loo         740  diawri.f90
    nemo.exe           000000000082D58A  diawri_mp_dia_wri         591  diawri.f90
    nemo.exe           0000000000473AFB  step_mp_sub_loop_         508  step.f90
    nemo.exe           000000000046D26E  step_mp_stp_               84  step.f90
    nemo.exe           0000000000743FC6  agrif_util_mp_agr         570  modutil.f90
    nemo.exe           00000000004739B3  step_mp_sub_loop_         494  step.f90
    nemo.exe           000000000046D26E  step_mp_stp_               84  step.f90
    nemo.exe           00000000004013BD  nemogcm_mp_sub_lo         190  nemogcm.f90
    nemo.exe           00000000004011B1  nemogcm_mp_nemo_g         112  nemogcm.f90
    nemo.exe           0000000000400D36  MAIN__                     22  nemo.f90
    nemo.exe           000000000206C6D2  Unknown               Unknown  Unknown
    nemo.exe           00000000020CF521  Unknown               Unknown  Unknown
    nemo.exe           0000000000400BC9  Unknown               Unknown  Unknown
    

    At lines 507-508 in step.f90 it looks like the code is preparing to abort, possibly in response to the zonal velocity error detected earlier.

    IF( indic < 0        )   THEN
                                   CALL ctl_stop( 'step: indic < 0' )
                                   CALL dia_wri_state( 'output.abort', kstp )
    ENDIF
    

    It's the writing out of the current state that seems to be causing the seg faults.

    Thanks, Michael

  • Message #58

    I've just attached a zip file that contains four files.

    1. NEMOintel.o69100: the standard output file
    1. ocean.output
    1. 1_ocean.output
    1. EXP00_listing.txt: a listing of all the files in the EXP00 folder.
  • Message #59

    The subroutine defined in histcomm.f90 indicated by the seg fault stack trace is called histwrite_real and the specific line identified (2017) involves an array assignment.

    tbf_2(1:nx*ny*nz) = tbf_1(kt+1:kt+nx*ny*nz)
    

    Does this ORCA2_LIM agrif test fail for you when you run it?

  • Message #60

    Hello Michael,

    Yes, I got exactly the same error and time step.
    We are going to investigate it and we will let you know.

    Regards,
    Nicolas

  • Message #61

    Hi Nicolas,

    Compiling NEMO with the Cray compilers (crayftn v8.5.8 to be precise) does improve the situation.

    The code no longer seg faults, but it does still abort once it finds that the zonal velocity is larger than 20 m/s. However, the velocity is not as high as it is with the Intel case.

    kt=   265 max abs(U):   29.93    , i j k:    65   77    6
    

    When using the Cray compiler I had to add the key_nosignedzero key, see below.

    ./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_CRAY -m XC_ARCHER add_key "key_nosignedzero key_agrif"
    

    And below are the Cray compiler options I used.

    -em -s integer32 -s real64 -g -O0 -e0 -eZ
    

    Regards,
    Michael

  • Message #62

    I would not say that it's improving even there is no more seg fault, you made only 265 time steps as the instability appears earlier.

    You should have in your computing directory a snapshot of the variable values for the last time steps (*output.abort*.nc), then you can check with NetCDF tools (CDO, ncview, …) what is the origin of the instability.
    If you don't want to rebuild files, you can choose to produce single files by changing the type setting in iodef.xml:

    <file_definition type="one_file" name="@expname@_@freq@_@startdate@_@enddate@" sync_freq="10d" min_digits="4">
    

    The Agulhas configuration is a demonstrator of the AGRIF nesting possibilities but it has not been heavily tested because of lack of scientific interest.

Tree View Flat View (newer first) Flat View (older first)

Attachments (1)

Download all attachments as: .zip