Beginner (#1) - [ref-cfg][solved] Intel build (with key_agrif) based on ORCA2_LIM seg faults (#35) - Message List

[ref-cfg][solved] Intel build (with key_agrif) based on ORCA2_LIM seg faults
 solved

Hello,

I'm trying to run Intel-compiled NEMO using key_agrif with a configuration based on ORCA2_LIM.

Unfortunately, the run seg faults after around 25 minutes.

An error is also reported in the 1_ocean.output file; it reads

stpctl: the zonal velocity is larger than 20 m/s

and below that there is a line that reads

kt=  2353 max abs(U):   151.5    , i j k:   157   70    1

Do you think this error is causing the segmentation faults?

Below is a full description of my setup, I'm running on ARCHER, a Cray XC-30.

First off, I downloaded these versions of XIOS and NEMO.

svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0
svn --username 'mbareford' co -r 9301 http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM

Next, I setup the following module environment on ARCHER.

module swap PrgEnv-cray PrgEnv-intel/5.2.82
module load cray-hdf5-parallel/1.10.0.1
module load cray-netcdf-hdf5parallel/4.4.1.1
module load boost/1.60

The PrgEnv-intel module loads intel/17.0.0.098.

I then compiled XIOS and then NEMO, below is the make command for NEMO.

./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_INTEL -m XC_ARCHER_INTEL add_key "key_agrif"

The machine configuration specified by XC_ARCHER_INTEL ensures that the build is linked to XIOS v1.0 libraries. The compile flags are as follows.

-integer-size 32 -real-size 64 -g -O0 -fp-model source -zero -fpp -warn all

After building NEMO, I populated the EXP00 folder with the files archived in ​ORCA2_LIM_nemo_v3.6.tar.

Finally, I'm running NEMO in unattached mode (using_server=false) and so the aprun command is as follows.

...
#PBS -l select=4
...
aprun -n 24 -N 6 $NEMO_PATH/nemo.exe

The submission script selects 4 ARCHER nodes, there are 24 cores per node.

  • Message #56

    Just a side comment, it is not the first time I edit your messages in order to improve the web rendering.
    The readability of the message can help to identify the different parts and the important informations of your message.

    I suggest you to give a look at the following pages in order to learn a little of the Trac wiki syntax: Wiki Formatting | Wiki Processors.
    Or you can click on 'Edit' to see how I modified your text.

    • Message #57

      Thank you for editing my last post. I will endeavour to make use of the Trac wiki syntax in future.

      I've recompiled NEMO with the -traceback option so as to track the source of the seg faults. The standard output file now contains a stack trace.

      forrtl: severe (174): SIGSEGV, segmentation fault occurred
      Image              PC                Routine            Line        Source
      nemo.exe           00000000020B3DF1  Unknown               Unknown  Unknown
      nemo.exe           00000000020B1F2B  Unknown               Unknown  Unknown
      nemo.exe           00000000020625A4  Unknown               Unknown  Unknown
      nemo.exe           00000000020623B6  Unknown               Unknown  Unknown
      nemo.exe           0000000001FEF404  Unknown               Unknown  Unknown
      nemo.exe           0000000001FF27B0  Unknown               Unknown  Unknown
      nemo.exe           0000000001D33AD0  Unknown               Unknown  Unknown
      nemo.exe           0000000000AED2B5  histcom_mp_histwr        2017  histcom.f90
      nemo.exe           0000000000AEA9E1  histcom_mp_histw_        1872  histcom.f90
      nemo.exe           0000000000AE79C8  histcom_mp_histwr        1676  histcom.f90
      nemo.exe           0000000000C3C790  limwri_2_mp_sub_l         215  limwri_2.f90
      nemo.exe           0000000000C3B24C  limwri_2_mp_lim_w         143  limwri_2.f90
      nemo.exe           000000000082F2F3  diawri_mp_sub_loo         740  diawri.f90
      nemo.exe           000000000082D58A  diawri_mp_dia_wri         591  diawri.f90
      nemo.exe           0000000000473AFB  step_mp_sub_loop_         508  step.f90
      nemo.exe           000000000046D26E  step_mp_stp_               84  step.f90
      nemo.exe           0000000000743FC6  agrif_util_mp_agr         570  modutil.f90
      nemo.exe           00000000004739B3  step_mp_sub_loop_         494  step.f90
      nemo.exe           000000000046D26E  step_mp_stp_               84  step.f90
      nemo.exe           00000000004013BD  nemogcm_mp_sub_lo         190  nemogcm.f90
      nemo.exe           00000000004011B1  nemogcm_mp_nemo_g         112  nemogcm.f90
      nemo.exe           0000000000400D36  MAIN__                     22  nemo.f90
      nemo.exe           000000000206C6D2  Unknown               Unknown  Unknown
      nemo.exe           00000000020CF521  Unknown               Unknown  Unknown
      nemo.exe           0000000000400BC9  Unknown               Unknown  Unknown
      

      At lines 507-508 in step.f90 it looks like the code is preparing to abort, possibly in response to the zonal velocity error detected earlier.

      IF( indic < 0        )   THEN
                                     CALL ctl_stop( 'step: indic < 0' )
                                     CALL dia_wri_state( 'output.abort', kstp )
      ENDIF
      

      It's the writing out of the current state that seems to be causing the seg faults.

      Thanks, Michael

      • Message #58

        I've just attached a zip file that contains four files.

        1. NEMOintel.o69100: the standard output file
        1. ocean.output
        1. 1_ocean.output
        1. EXP00_listing.txt: a listing of all the files in the EXP00 folder.
        • Message #59

          The subroutine defined in histcomm.f90 indicated by the seg fault stack trace is called histwrite_real and the specific line identified (2017) involves an array assignment.

          tbf_2(1:nx*ny*nz) = tbf_1(kt+1:kt+nx*ny*nz)
          

          Does this ORCA2_LIM agrif test fail for you when you run it?

          • Message #60

            Hello Michael,

            Yes, I got exactly the same error and time step.
            We are going to investigate it and we will let you know.

            Regards,
            Nicolas

            • Message #61

              Hi Nicolas,

              Compiling NEMO with the Cray compilers (crayftn v8.5.8 to be precise) does improve the situation.

              The code no longer seg faults, but it does still abort once it finds that the zonal velocity is larger than 20 m/s. However, the velocity is not as high as it is with the Intel case.

              kt=   265 max abs(U):   29.93    , i j k:    65   77    6
              

              When using the Cray compiler I had to add the key_nosignedzero key, see below.

              ./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_CRAY -m XC_ARCHER add_key "key_nosignedzero key_agrif"
              

              And below are the Cray compiler options I used.

              -em -s integer32 -s real64 -g -O0 -e0 -eZ
              

              Regards,
              Michael

              • Message #62

                I would not say that it's improving even there is no more seg fault, you made only 265 time steps as the instability appears earlier.

                You should have in your computing directory a snapshot of the variable values for the last time steps (*output.abort*.nc), then you can check with NetCDF tools (CDO, ncview, …) what is the origin of the instability.
                If you don't want to rebuild files, you can choose to produce single files by changing the type setting in iodef.xml:

                <file_definition type="one_file" name="@expname@_@freq@_@startdate@_@enddate@" sync_freq="10d" min_digits="4">
                

                The Agulhas configuration is a demonstrator of the AGRIF nesting possibilities but it has not been heavily tested because of lack of scientific interest.

                • Message #63

                  Hi Nicolas,

                  Have you made any progress with your investigation?

                  I've used ncview to view the output.abort*.nc files produced by the Cray-compiled NEMO.
                  There are several variables that have extreme values.

                  For example, the sea surface temperature (isstempe) has a maximum value of 117.2 degrees Celsius.
                  And the minimum value for the net downward heat flux (sohefldo) is -18478.7 W/m2.

                  Unfortunately, the Intel version, which continues the simulation for longer (reaching timestep 2353),
                  seg faults during the writing of the abort netcdf files. This time a set of 1_output.abort*.nc files is created,
                  one for each MPI process. These files are truncated however, each one is 19136 bytes in size and
                  contains data for just two variables, nav_lon and nav_lat.

                  Regards,
                  Michael

                  PS Setting the file definition type to one_file in iodef.xml didn't work for me.
                  The simulation produced errors like the following example.

                  > Error [CNc4DataOutput::writeField_(CField* field)] : In file '/work/z01/z01/mrb/codes/nemo/xios-1.0/src/output/nc4_data_output.cpp', line 916 -> On writing field : ice_pres
                  In the context : nemo
                  Error in calling function nc_def_var_deflate(ncid, varId, false, (compressionLevel > 0), compressionLevel)
                  NetCDF: Invalid argument
                  Unable to set the compression level of the variable with id: 2 and compression level: 0
                  
                  • Message #71

                    Hello,

                    Just wondering if anyone is looking at this query as there's been no activity for three weeks.

                    Would it be worthwhile to test again with a different release?
                    I've been using XIOS v1.0 (r703) with the stable 2015 NEMO v3.6.0 (r9301) release, see the top of this thread.

                    As you know, I've been compiling NEMO with the Cray and Intel compilers installed on the ARCHER machine.
                    And I've been running NEMO with and without the agrif key, so that makes 4 different combinations in all.

                    Now, as stated previously, using key agrif causes the simulation to terminate, that's what this query is about.

                    The Intel build without agrif however does run successfully, so I would have thought the Cray build would also
                    run to completion with agrif switched off. Surprisingly it does not: the Cray run aborts due to the zonal velocity
                    being greater than 20 m/s.

                    kt=   854 max abs(U):   20.93    , i j k:    66   76    5
                    

                    This suggests the problem is not being caused by the agrif feature.

                    Thanks, Michael

                    • Message #76

                      Hello again,

                      Just to let you know, I've been able to run without error a NEMO AGRIF test case on the ARCHER machine.
                      It was rather straightforward in the end: I used one of the SETTE tests, test 16.

                      The code was compiled with the Intel compilers (intel v17.0.0.098).

                      I couldn't compile the code with the Cray compilers (cce v8.5.8) however.
                      An error, "ftn-3178 crayftn: LIMIT in command line" was reported when compiling cyclone.f90.

                      But that's a different problem.
                      This query can be closed.

                      Thanks,
                      Michael

                      • Message #86

                        Hello again,

                        Just to let you know, I've been able to run without error a NEMO AGRIF test case on the ARCHER machine.
                        It was rather straightforward in the end: I used one of the SETTE tests, test 16.

                        The code was compiled with the Intel compilers (intel v17.0.0.098).

                        I couldn't compile the code with the Cray compilers (cce v8.5.8) however.
                        An error, "ftn-3178 crayftn: LIMIT in command line" was reported when compiling cyclone.f90.

                        But that's a different problem.
                        This query can be closed.

                        Thanks,
                        Michael

                        Thanks Mickael for trying to get things done for yourself.

                        I would say that the issue is probably still there because the SETTE test is much quicker than the default run.
                        Regarding the Cray compiler, I do not have the availability to test it but most of the time a single compilation issue hides something specific with the compiler which can be more rigorous on the syntax.

                        Regards,
                        Nicolas

Attachments (1)

Download all attachments as: .zip