Users (#2) - [AGRIF] Intel build (with key_agrif) based on ORCA2_LIM seg faults (#35) - Message List
I'm trying to run Intel-compiled NEMO using key_agrif with a configuration based on ORCA2_LIM.
Unfortunately, the run seg faults after around 25 minutes.
An error is also reported in the 1_ocean.output file; it reads
stpctl: the zonal velocity is larger than 20 m/s
and below that there is a line that reads
kt= 2353 max abs(U): 151.5 , i j k: 157 70 1
Do you think this error is causing the segmentation faults?
Below is a full description of my setup, I'm running on ARCHER, a Cray XC-30.
First off, I downloaded these versions of XIOS and NEMO.
svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0 svn --username 'mbareford' co -r 9301 http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM
Next, I setup the following module environment on ARCHER.
module swap PrgEnv-cray PrgEnv-intel/5.2.82 module load cray-hdf5-parallel/184.108.40.206 module load cray-netcdf-hdf5parallel/220.127.116.11 module load boost/1.60
The PrgEnv-intel module loads intel/17.0.0.098.
I then compiled XIOS and then NEMO, below is the make command for NEMO.
./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_INTEL -m XC_ARCHER_INTEL add_key "key_agrif"
The machine configuration specified by XC_ARCHER_INTEL ensures that the build is linked to XIOS v1.0 libraries. The compile flags are as follows.
-integer-size 32 -real-size 64 -g -O0 -fp-model source -zero -fpp -warn all
After building NEMO, I populated the EXP00 folder with the files archived in ORCA2_LIM_nemo_v3.6.tar.
Finally, I'm running NEMO in unattached mode (using_server=false) and so the aprun command is as follows.
... #PBS -l select=4 ... aprun -n 24 -N 6 $NEMO_PATH/nemo.exe
The submission script selects 4 ARCHER nodes, there are 24 cores per node.
Just a side comment, it is not the first time I edit your messages in order to improve the web rendering.
The readability of the message can help to identify the different parts and the important informations of your message.nicolasmartin2018-02-14 12:43 CET (6 days ago)
Thank you for editing my last post. I will endeavour to make use of the Trac wiki syntax in future.
I've recompiled NEMO with the -traceback option so as to track the source of the seg faults. The standard output file now contains a stack trace.
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source nemo.exe 00000000020B3DF1 Unknown Unknown Unknown nemo.exe 00000000020B1F2B Unknown Unknown Unknown nemo.exe 00000000020625A4 Unknown Unknown Unknown nemo.exe 00000000020623B6 Unknown Unknown Unknown nemo.exe 0000000001FEF404 Unknown Unknown Unknown nemo.exe 0000000001FF27B0 Unknown Unknown Unknown nemo.exe 0000000001D33AD0 Unknown Unknown Unknown nemo.exe 0000000000AED2B5 histcom_mp_histwr 2017 histcom.f90 nemo.exe 0000000000AEA9E1 histcom_mp_histw_ 1872 histcom.f90 nemo.exe 0000000000AE79C8 histcom_mp_histwr 1676 histcom.f90 nemo.exe 0000000000C3C790 limwri_2_mp_sub_l 215 limwri_2.f90 nemo.exe 0000000000C3B24C limwri_2_mp_lim_w 143 limwri_2.f90 nemo.exe 000000000082F2F3 diawri_mp_sub_loo 740 diawri.f90 nemo.exe 000000000082D58A diawri_mp_dia_wri 591 diawri.f90 nemo.exe 0000000000473AFB step_mp_sub_loop_ 508 step.f90 nemo.exe 000000000046D26E step_mp_stp_ 84 step.f90 nemo.exe 0000000000743FC6 agrif_util_mp_agr 570 modutil.f90 nemo.exe 00000000004739B3 step_mp_sub_loop_ 494 step.f90 nemo.exe 000000000046D26E step_mp_stp_ 84 step.f90 nemo.exe 00000000004013BD nemogcm_mp_sub_lo 190 nemogcm.f90 nemo.exe 00000000004011B1 nemogcm_mp_nemo_g 112 nemogcm.f90 nemo.exe 0000000000400D36 MAIN__ 22 nemo.f90 nemo.exe 000000000206C6D2 Unknown Unknown Unknown nemo.exe 00000000020CF521 Unknown Unknown Unknown nemo.exe 0000000000400BC9 Unknown Unknown Unknown
At lines 507-508 in step.f90 it looks like the code is preparing to abort, possibly in response to the zonal velocity error detected earlier.
IF( indic < 0 ) THEN CALL ctl_stop( 'step: indic < 0' ) CALL dia_wri_state( 'output.abort', kstp ) ENDIF
It's the writing out of the current state that seems to be causing the seg faults.
Thanks, Michaelmbareford2018-02-14 15:49 CET (6 days ago)
I've just attached a zip file that contains four files.
- NEMOintel.o69100: the standard output file
mbareford2018-02-14 16:17 CET (6 days ago)
- EXP00_listing.txt: a listing of all the files in the EXP00 folder.
The subroutine defined in histcomm.f90 indicated by the seg fault stack trace is called histwrite_real and the specific line identified (2017) involves an array assignment.
tbf_2(1:nx*ny*nz) = tbf_1(kt+1:kt+nx*ny*nz)
Does this ORCA2_LIM agrif test fail for you when you run it?mbareford2018-02-15 15:50 CET (5 days ago)
Yes, I got exactly the same error and time step.
We are going to investigate it and we will let you know.
Nicolasnicolasmartin2018-02-15 18:11 CET (5 days ago)
Compiling NEMO with the Cray compilers (crayftn v8.5.8 to be precise) does improve the situation.
The code no longer seg faults, but it does still abort once it finds that the zonal velocity is larger than 20 m/s. However, the velocity is not as high as it is with the Intel case.
kt= 265 max abs(U): 29.93 , i j k: 65 77 6
When using the Cray compiler I had to add the key_nosignedzero key, see below.
./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_CRAY -m XC_ARCHER add_key "key_nosignedzero key_agrif"
And below are the Cray compiler options I used.
-em -s integer32 -s real64 -g -O0 -e0 -eZ
Michaelmbareford2018-02-16 13:08 CET (4 days ago)
I would not say that it's improving even there is no more seg fault, you made only 265 time steps as the instability appears earlier.
You should have in your computing directory a snapshot of the variable values for the last time steps (*output.abort*.nc), then you can check with NetCDF tools (CDO, ncview, …) what is the origin of the instability.
If you don't want to rebuild files, you can choose to produce single files by changing the type setting in iodef.xml:
<file_definition type="one_file" name="@expname@_@freq@_@startdate@_@enddate@" sync_freq="10d" min_digits="4">
The Agulhas configuration is a demonstrator of the AGRIF nesting possibilities but it has not been heavily tested because of lack of scientific interest.nicolasmartin2018-02-16 16:03 CET (4 days ago)