Users (#2) - [AGRIF] Intel build (with key_agrif) based on ORCA2_LIM seg faults (#35) - Message List
I'm trying to run Intel-compiled NEMO using key_agrif with a configuration based on ORCA2_LIM.
Unfortunately, the run seg faults after around 25 minutes.
An error is also reported in the 1_ocean.output file; it reads
stpctl: the zonal velocity is larger than 20 m/s
and below that there is a line that reads
kt= 2353 max abs(U): 151.5 , i j k: 157 70 1
Do you think this error is causing the segmentation faults?
Below is a full description of my setup, I'm running on ARCHER, a Cray XC-30.
First off, I downloaded these versions of XIOS and NEMO.
svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0 svn --username 'mbareford' co -r 9301 http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM
Next, I setup the following module environment on ARCHER.
module swap PrgEnv-cray PrgEnv-intel/5.2.82 module load cray-hdf5-parallel/126.96.36.199 module load cray-netcdf-hdf5parallel/188.8.131.52 module load boost/1.60
The PrgEnv-intel module loads intel/17.0.0.098.
I then compiled XIOS and then NEMO, below is the make command for NEMO.
./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_INTEL -m XC_ARCHER_INTEL add_key "key_agrif"
The machine configuration specified by XC_ARCHER_INTEL ensures that the build is linked to XIOS v1.0 libraries. The compile flags are as follows.
-integer-size 32 -real-size 64 -g -O0 -fp-model source -zero -fpp -warn all
After building NEMO, I populated the EXP00 folder with the files archived in ORCA2_LIM_nemo_v3.6.tar.
Finally, I'm running NEMO in unattached mode (using_server=false) and so the aprun command is as follows.
... #PBS -l select=4 ... aprun -n 24 -N 6 $NEMO_PATH/nemo.exe
The submission script selects 4 ARCHER nodes, there are 24 cores per node.
Just a side comment, it is not the first time I edit your messages in order to improve the web rendering.
The readability of the message can help to identify the different parts and the important informations of your message.nicolasmartin2018-02-14 12:43 CET (3 months ago)
Thank you for editing my last post. I will endeavour to make use of the Trac wiki syntax in future.
I've recompiled NEMO with the -traceback option so as to track the source of the seg faults. The standard output file now contains a stack trace.
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source nemo.exe 00000000020B3DF1 Unknown Unknown Unknown nemo.exe 00000000020B1F2B Unknown Unknown Unknown nemo.exe 00000000020625A4 Unknown Unknown Unknown nemo.exe 00000000020623B6 Unknown Unknown Unknown nemo.exe 0000000001FEF404 Unknown Unknown Unknown nemo.exe 0000000001FF27B0 Unknown Unknown Unknown nemo.exe 0000000001D33AD0 Unknown Unknown Unknown nemo.exe 0000000000AED2B5 histcom_mp_histwr 2017 histcom.f90 nemo.exe 0000000000AEA9E1 histcom_mp_histw_ 1872 histcom.f90 nemo.exe 0000000000AE79C8 histcom_mp_histwr 1676 histcom.f90 nemo.exe 0000000000C3C790 limwri_2_mp_sub_l 215 limwri_2.f90 nemo.exe 0000000000C3B24C limwri_2_mp_lim_w 143 limwri_2.f90 nemo.exe 000000000082F2F3 diawri_mp_sub_loo 740 diawri.f90 nemo.exe 000000000082D58A diawri_mp_dia_wri 591 diawri.f90 nemo.exe 0000000000473AFB step_mp_sub_loop_ 508 step.f90 nemo.exe 000000000046D26E step_mp_stp_ 84 step.f90 nemo.exe 0000000000743FC6 agrif_util_mp_agr 570 modutil.f90 nemo.exe 00000000004739B3 step_mp_sub_loop_ 494 step.f90 nemo.exe 000000000046D26E step_mp_stp_ 84 step.f90 nemo.exe 00000000004013BD nemogcm_mp_sub_lo 190 nemogcm.f90 nemo.exe 00000000004011B1 nemogcm_mp_nemo_g 112 nemogcm.f90 nemo.exe 0000000000400D36 MAIN__ 22 nemo.f90 nemo.exe 000000000206C6D2 Unknown Unknown Unknown nemo.exe 00000000020CF521 Unknown Unknown Unknown nemo.exe 0000000000400BC9 Unknown Unknown Unknown
At lines 507-508 in step.f90 it looks like the code is preparing to abort, possibly in response to the zonal velocity error detected earlier.
IF( indic < 0 ) THEN CALL ctl_stop( 'step: indic < 0' ) CALL dia_wri_state( 'output.abort', kstp ) ENDIF
It's the writing out of the current state that seems to be causing the seg faults.
Thanks, Michaelmbareford2018-02-14 15:49 CET (3 months ago)
I've just attached a zip file that contains four files.
- NEMOintel.o69100: the standard output file
mbareford2018-02-14 16:17 CET (3 months ago)
- EXP00_listing.txt: a listing of all the files in the EXP00 folder.
The subroutine defined in histcomm.f90 indicated by the seg fault stack trace is called histwrite_real and the specific line identified (2017) involves an array assignment.
tbf_2(1:nx*ny*nz) = tbf_1(kt+1:kt+nx*ny*nz)
Does this ORCA2_LIM agrif test fail for you when you run it?mbareford2018-02-15 15:50 CET (3 months ago)
Yes, I got exactly the same error and time step.
We are going to investigate it and we will let you know.
Nicolasnicolasmartin2018-02-15 18:11 CET (3 months ago)
Compiling NEMO with the Cray compilers (crayftn v8.5.8 to be precise) does improve the situation.
The code no longer seg faults, but it does still abort once it finds that the zonal velocity is larger than 20 m/s. However, the velocity is not as high as it is with the Intel case.
kt= 265 max abs(U): 29.93 , i j k: 65 77 6
When using the Cray compiler I had to add the key_nosignedzero key, see below.
./makenemo -r ORCA2_LIM -n ORCA2_AGRIF_CRAY -m XC_ARCHER add_key "key_nosignedzero key_agrif"
And below are the Cray compiler options I used.
-em -s integer32 -s real64 -g -O0 -e0 -eZ
Michaelmbareford2018-02-16 13:08 CET (3 months ago)
I would not say that it's improving even there is no more seg fault, you made only 265 time steps as the instability appears earlier.
You should have in your computing directory a snapshot of the variable values for the last time steps (*output.abort*.nc), then you can check with NetCDF tools (CDO, ncview, …) what is the origin of the instability.
If you don't want to rebuild files, you can choose to produce single files by changing the type setting in iodef.xml:
<file_definition type="one_file" name="@expname@_@freq@_@startdate@_@enddate@" sync_freq="10d" min_digits="4">
The Agulhas configuration is a demonstrator of the AGRIF nesting possibilities but it has not been heavily tested because of lack of scientific interest.nicolasmartin2018-02-16 16:03 CET (3 months ago)
Have you made any progress with your investigation?
I've used ncview to view the output.abort*.nc files produced by the Cray-compiled NEMO.
There are several variables that have extreme values.
For example, the sea surface temperature (isstempe) has a maximum value of 117.2 degrees Celsius.
And the minimum value for the net downward heat flux (sohefldo) is -18478.7 W/m2.
Unfortunately, the Intel version, which continues the simulation for longer (reaching timestep 2353),
seg faults during the writing of the abort netcdf files. This time a set of 1_output.abort*.nc files is created,
one for each MPI process. These files are truncated however, each one is 19136 bytes in size and
contains data for just two variables, nav_lon and nav_lat.
PS Setting the file definition type to one_file in iodef.xml didn't work for me.
The simulation produced errors like the following example.
> Error [CNc4DataOutput::writeField_(CField* field)] : In file '/work/z01/z01/mrb/codes/nemo/xios-1.0/src/output/nc4_data_output.cpp', line 916 -> On writing field : ice_pres In the context : nemo Error in calling function nc_def_var_deflate(ncid, varId, false, (compressionLevel > 0), compressionLevel) NetCDF: Invalid argument Unable to set the compression level of the variable with id: 2 and compression level: 0mbareford2018-03-05 17:15 CET (3 months ago)
Just wondering if anyone is looking at this query as there's been no activity for three weeks.
As you know, I've been compiling NEMO with the Cray and Intel compilers installed on the ARCHER machine.
And I've been running NEMO with and without the agrif key, so that makes 4 different combinations in all.
Now, as stated previously, using key agrif causes the simulation to terminate, that's what this query is about.
The Intel build without agrif however does run successfully, so I would have thought the Cray build would also
run to completion with agrif switched off. Surprisingly it does not: the Cray run aborts due to the zonal velocity
being greater than 20 m/s.
kt= 854 max abs(U): 20.93 , i j k: 66 76 5
This suggests the problem is not being caused by the agrif feature.
Thanks, Michaelmbareford2018-03-27 14:17 CEST (2 months ago)
Just to let you know, I've been able to run without error a NEMO AGRIF test case on the ARCHER machine.
It was rather straightforward in the end: I used one of the SETTE tests, test 16.
The code was compiled with the Intel compilers (intel v17.0.0.098).
I couldn't compile the code with the Cray compilers (cce v8.5.8) however.
An error, "ftn-3178 crayftn: LIMIT in command line" was reported when compiling cyclone.f90.
But that's a different problem.
This query can be closed.
Michaelmbareford2018-04-10 11:38 CEST (7 weeks ago)