Opened 3 years ago
Last modified 3 years ago
#2653 assigned Defect
Namelist quirks with Cray compilers (trunk only)
Reported by: | acc | Owned by: | acc |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | TOP | Version: | trunk |
Severity: | minor | Keywords: | |
Cc: |
Description
Context
The behaviour of internal file based namelists differs in codes compiled with or without MPI using recent Cray compilers. Expected behaviour can be obtained if, and only if, every .true. or .false. stand-alone assignment is followed by a comma. These commas should not be necessary and are not necessary with intel or gfortran compilers (or even Crayftn without MPI??). This has been reported to Cray. Meanwhile, some simple scripts (see below) can be used to add commas. Details follow, including a simple program that reproduces the behaviour.
Analysis
The background
The current development version of NEMO uses internal files for its namelists. This is to avoid filesystem accesses; instead rank zero reads the entire namelist, strips comments and broadcasts the entire character buffer to everyone else. This is working on a variety of systems (including ARCHER).
On Archer2, I've been encountering errors which the model assumes are misspelt variables in the namelist but the input namelists are fine. It turns out I can replicate these errors with a simple namelist; but, strangely, only in an MPI environment.
The example program
The program: errornml_mpi.F90 (listed below) defines a simple namelist (as a character string) and broadcasts this to other ranks. The program can also be compiled without mpi. Two different namelists can be used, decided by a command-line argument: g (for good) or b (for bad). The two namelists differ only by the
addition of commas immediately after .true. or .false. settings. This is my work-around because it works with the commas but not without them. Commas should not be necessary (and aren't necessary for all the real, integer and character variables in the full namelist).
Currently Loaded compiler:
ftn -V
Cray Fortran : Version 10.0.4
In non-mpi mode, I find no difference in behaviour between the two namelists:
ftn -o errornml -O1 errornml_mpi.F90 ./errornml g The whole internal file: &namctl sn_cfctl%l_trcstat=.true., sn_cfctl%l_runstat=.true., / ---- namctl in namelist_cfg: &NAMCTL SN_CFCTL = T, T, F, F, F, F, F, 0, 1000000, 1, 1, LN_TIMING = F, LN_DIACFL = F, NN_ISPLT = 0, NN_JSPLT = 0, NN_ICTLS = 0, NN_ICTLE = 0, NN_JCTLS = 0, NN_JCTLE = 0 / ---- STOP ./errornml b The whole internal file: &namctl sn_cfctl%l_trcstat=.true. sn_cfctl%l_runstat=.true. / ---- namctl in namelist_cfg: &NAMCTL SN_CFCTL = T, T, F, F, F, F, F, 0, 1000000, 1, 1, LN_TIMING = F, LN_DIACFL = F, NN_ISPLT = 0, NN_JSPLT = 0, NN_ICTLS = 0, NN_ICTLE = 0, NN_JCTLS = 0, NN_JCTLE = 0 / ---- STOP
With MPI and srun (even with just 1 rank) the behaviour changes:
ftn -o errornml_mpi -DUSE_MPI -O1 errornml_mpi.F90 srun -n 1 ./errornml_mpi g The whole internal file: &namctl sn_cfctl%l_trcstat=.true., sn_cfctl%l_runstat=.true., / ---- namctl in namelist_cfg: &NAMCTL SN_CFCTL = T, T, F, F, F, F, F, 0, 1000000, 1, 1, LN_TIMING = F, LN_DIACFL = F, NN_ISPLT = 0, NN_JSPLT = 0, NN_ICTLS = 0, NN_ICTLE = 0, NN_JCTLS = 0, NN_JCTLE = 0 / ---- STOP srun -n 1 ./errornml_mpi b The whole internal file: &namctl sn_cfctl%l_trcstat=.true. sn_cfctl%l_runstat=.true. / ---- namctl in namelist_cfg: Not found or bad Not found or bad MPICH ERROR [Rank 0] [job id 42135.24] [Sat Nov 7 13:28:42 2020] [unknown] [nid001114] - Abort(25) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 25) - process 0
and just to prove the MPI is working:
srun -n 2 ./errornml_mpi g The whole internal file: &namctl sn_cfctl%l_trcstat=.true., sn_cfctl%l_runstat=.true., / ---- namctl in namelist_cfg: &NAMCTL SN_CFCTL = T, T, F, F, F, F, F, 0, 1000000, 1, 1, LN_TIMING = F, LN_DIACFL = F, NN_ISPLT = 0, NN_JSPLT = 0, NN_ICTLS = 0, NN_ICTLE = 0, NN_JCTLS = 0, NN_JCTLE = 0 / ---- STOP ---- namctl in namelist_cfg: 1 The whole internal file: &namctl sn_cfctl%l_trcstat=.true., sn_cfctl%l_runstat=.true., / ---- &NAMCTL SN_CFCTL = T, T, F, F, F, F, F, 0, 1000000, 1, 1, LN_TIMING = F, LN_DIACFL = F, NN_ISPLT = 0, NN_JSPLT = 0, NN_ICTLS = 0, NN_ICTLE = 0, NN_JCTLS = 0, NN_JCTLE = 0 / ---- STOP
I can't explain this so I'm assuming it is a compiler bug but it is odd that it seems impossible to replicate without the mpi environment.
The work-around is to add commas (see scripts below) but it is tedious to have to maintain different namelists for different systems and Cray compilers are alone in this behaviour. The actual namelists are some 38k characters but it is only logical variables that need these commas appended.
Recommendation
Wait for Cray to fix their compiler. Meanwhile, run the edit_nmls script in the directory above tests and cfgs on each new checkout before building any configurations with Cray compilers. reverse_edit_nmls should reverse changes before any check-ins. Permanently adding the commas is an option but it is difficult to ensure they are maintained.
errornml_mpi.F90
PROGRAM errornml #if defined USE_MPI USE MPI #endif CHARACTER(LEN=:), ALLOCATABLE :: cdnambuff CHARACTER(LEN=100) :: this_works='&namctl sn_cfctl%l_trcstat=.true., sn_cfctl%l_runstat=.true., /' CHARACTER(LEN=100) :: this_breaks='&namctl sn_cfctl%l_trcstat=.true. sn_cfctl%l_runstat=.true. /' CHARACTER(LEN=100) :: use_this INTEGER :: kout = 6 LOGICAL :: ldwp =.FALSE. !: .true. only for the root broadcaster INTEGER :: itot, ierr, mpprank, mppsize TYPE :: sn_ctl LOGICAL :: l_runstat = .FALSE. LOGICAL :: l_trcstat = .FALSE. LOGICAL :: l_oceout = .FALSE. LOGICAL :: l_layout = .FALSE. LOGICAL :: l_prtctl = .FALSE. LOGICAL :: l_prttrc = .FALSE. LOGICAL :: l_oasout = .FALSE. INTEGER :: procmin = 0 INTEGER :: procmax = 1000000 INTEGER :: procincr = 1 INTEGER :: ptimincr = 1 END TYPE TYPE(sn_ctl), SAVE :: sn_cfctl LOGICAL :: ln_timing LOGICAL :: ln_diacfl INTEGER :: nn_ictls INTEGER :: nn_ictle INTEGER :: nn_jctls INTEGER :: nn_jctle INTEGER :: nn_isplt INTEGER :: nn_jsplt NAMELIST/namctl/ sn_cfctl, ln_timing, ln_diacfl, & & nn_isplt, nn_jsplt, nn_ictls, nn_ictle, nn_jctls, nn_jctle character(len=32) :: arg integer :: iarg use_this=this_works do iarg = 1, command_argument_count() call get_command_argument(iarg, arg) select case (arg) case ('g', 'good') use_this = this_works case ('b', 'bad') use_this = this_breaks end select end do ! # if defined USE_MPI CALL mpi_init( ierr ) CALL mpi_comm_size( MPI_COMM_WORLD, mppsize, ierr ) CALL mpi_comm_rank( MPI_COMM_WORLD, mpprank, ierr ) if( mpprank .eq. 0 ) ldwp = .true. if(ldwp) THEN #endif ! ! Test the contents of the internal file ! ! Write it all out: ! hand-crafted substitute for the load_nml subroutine which reads whole namelists and ! condenses them into a character string itot=len_trim(use_this) IF ( .NOT. ALLOCATED(cdnambuff) ) ALLOCATE( CHARACTER(LEN=itot) :: cdnambuff ) cdnambuff=TRIM(use_this) ! write(6,'(a)') 'The whole internal file: ' write(6,'(32A)') cdnambuff write(6,'(a)')'----' # if defined USE_MPI call mpp_bcast_nml( cdnambuff , itot ) #endif write(6,'(a)') 'namctl in namelist_cfg: ' read(cdnambuff, namctl, iostat=ios, end=99, err=99) write(6,namctl) write(6,'(a)')'----' goto 101 ! 99 write(6,'(a)') 'Not found or bad ',ios goto 98 ! 101 CONTINUE # if defined USE_MPI ELSE call mpp_bcast_nml( cdnambuff , itot ) write(*,'(a)')'----' write(*,'(a,i5)') 'namctl in namelist_cfg: ',mpprank write(6,'(a)') 'The whole internal file: ' write(6,'(32A)') cdnambuff write(6,'(a)')'----' read(cdnambuff, namctl, iostat=ios, end=98, err=98) write(*,namctl) write(*,'(a)')'----' ENDIF call MPI_BARRIER(MPI_COMM_WORLD, ierr) call MPI_FINALIZE(ierr) #endif STOP 98 write(6,'(a)') 'Not found or bad ',ios # if defined USE_MPI CALL MPI_ABORT(MPI_COMM_WORLD, ios, ierr) CONTAINS SUBROUTINE mpp_bcast_nml( cdnambuff , kleng ) CHARACTER(LEN=:) , ALLOCATABLE, INTENT(INOUT) :: cdnambuff INTEGER , INTENT(INOUT) :: kleng !!---------------------------------------------------------------------- !! *** routine mpp_bcast_nml *** !! !! ** Purpose : broadcast namelist character buffer !! !!---------------------------------------------------------------------- !! INTEGER :: iflag !!---------------------------------------------------------------------- ! call MPI_BCAST(kleng, 1, MPI_INT, 0, MPI_COMM_WORLD, iflag) call MPI_BARRIER(MPI_COMM_WORLD, iflag) IF ( .NOT. ALLOCATED(cdnambuff) ) ALLOCATE( CHARACTER(LEN=kleng) :: cdnambuff ) call MPI_BCAST(cdnambuff, kleng, MPI_CHARACTER, 0, MPI_COMM_WORLD, iflag) call MPI_BARRIER(MPI_COMM_WORLD, iflag) ! END SUBROUTINE mpp_bcast_nml #endif END PROGRAM errornml
edit_nmls (better suggestions are welcome):
#!/bin/bash for d in cfgs tests do cd $d for nml in `find ./ -name '*namelist_*'` do chk=$(grep -c -i -e"= *.false.," -e"= *.true.," $nml) if test $chk -eq 0 ; then echo "Changing : "$nml ed - $nml << EOF %s/=\( *\).false./=\1.false.,/ w q EOF ed - $nml << EOF %s/=\( *\).FALSE./=\1.FALSE.,/ w q EOF ed - $nml << EOF %s/=\( *\).true./=\1.true.,/ w q EOF ed - $nml << EOF %s/=\( *\).TRUE./=\1.TRUE.,/ w q EOF else echo $nml " may have already been processed: "$chk" lines already correct" fi done cd ../ done
reverse_edit_nmls:
cat reverse_edit_nmls #!/bin/bash for d in cfgs tests do cd $d for nml in `find ./ -name '*namelist_*'` do chk=$(grep -c -i -e"= *.false.," -e"= *.true.," $nml) if test $chk -ne 0 ; then echo "Changing : "$nml ed - $nml << EOF %s/=\( *\).false.,/=\1.false./ w q EOF ed - $nml << EOF %s/=\( *\).FALSE.,/=\1.FALSE./ w q EOF ed - $nml << EOF %s/=\( *\).true.,/=\1.true./ w q EOF ed - $nml << EOF %s/=\( *\).TRUE.,/=\1.TRUE./ w q EOF else echo $nml " may have already been processed: "$chk" lines already reverted" fi done cd ../ done
Commit History (0)
(No commits)
Change History (2)
comment:1 Changed 3 years ago by acc
- Owner set to acc
- Status changed from new to assigned
- Version changed from v4.0.* to trunk
The solution works for crayftn 8.7.7. The implementation for MetOffce? configuration will be documented in ticket #2655