wiki:2009WP/2009Stream1

Context Navigation

Version 10 (modified by acc, 14 years ago) (diff)
--

2009 Stream 1 : User Interface

Last edited Timestamp?

2009 Stream 1 : User Interface

The stream 1, named User Interface is a never end stream. In 2008, the main works in the stream have involved the ocean forcing interface (introduction of SBC module + on-the-fly interpolation) and the TOP interface. In 2009, the major expected improvement concerns the model output (introduce an easy and efficient way to add/remove model output fields (new IOM for output). Note also the starting of 2 years work on a simplification of the new configurations settings (CFG=Configuration manager). The actions of the stream are:

S1.1 : IOM for outputs

The defined strategy is built on the use of catalogues that are dynamically created and written once for all at the end of step. The temporal mean is no more performed in the IO library, but is done in NEMO. The write itself will use the new IOIPSL module, or a dimg module. There is many improvements associated with this strategy. One example, a call to iom_put (i.e. a write into a catalogue) can be done anywhere in the code. This will greatly improve the code readability and will allow easy output of local variable.
Work :

(1) beta version expected in mid-February for the ocean (v3.2)

Status: Added in the trunk and validated (NEMO-paris, contribution from IPSL), see tickets #387 .
Ready for incorporation into v3.2

(2) Add a user-friendly interface (namelist or xml file) + a documentation (v3.2)

Status: Added the namelist and xml file in the trunk and validated (NEMO-paris, contribution from IPSL).
Documentation is missing.

(3) introduce IOM in the other components (LIM, TOP) (v3.2)

Status: Added in the trunk and validated for TOP (NEMO-paris), see tickets #437 .
On-going work for LIM 2 and 3.

(4) assessment of IOM behaviour on mpp computers (v3.3)

Preliminary studies using an ORCA025 configuration and various numbers of IO servers have been carried out with v3.2_beta on the UK's HECToR service (CRAY XT4). Tests with a regular 16x16 domain decomposition have been successful when employing 64, 32 or 16 IO servers but significant increases in elapsed times are evident for less than 64 IO servers:

-----------------         global_mpi_buffer_size
jpnij = jpni*jpnj                   |
-----------------                   v
# io servers  | # ocean procs | buffersize | success? | runtime (s) | comments
----------------------------------------------------------------------------------------------------
      0       |      256      |     -      |    yes   |    2255     | Base run without IO servers
     64       |      256      |    128     |    yes   |    2318     |
              |               |            |          |             |
     32       |      256      |    128     |     no   |     -       | " Plus de requete disponible !!!!"
     32       |      256      |    512     |    yes   |    2654     |
              |               |            |          |             |
     16       |      256      |    512     |    yes   |    3782     |
     16       |      256      |    640     |    yes   |    3406     | run with corrected colour assignment (null effect in this case)
     16       |      256      |    640     |    yes   |    2819     | ditto but with only 2 IO servers on each node
     16       |      256      |    640     |    yes   |    2532     | ditto but with only 1 IO server on each node  <---best

Manufacturers, such as CRAY, are suggesting using the square root of the number of processing elements as a guide figure for the number of I/O dedicated processes on modern MPP architectures. With this in mind, further investigations have been carried out into the degradation in performance experienced with 16 IO servers. Variations in configuration parameters (such as global_mpi_buffer_size), the placement of IO processes within multi-core processors and environment settings (such as MPICH_UNEX_BUFFER_SIZE) were tested in an attempt to reduce overheads. Limiting to 1 IO server per quad-core processor and using a buffer size of 1024 produced the best performance. Such findings are likely to be machine-architecture specific.

For the higher resolution models, the ability to discard land-only regions is essential. Preliminary tests with a 16x16 ORCA025 composition on 221 processors (thereby discarding 35 land-only regions) have been completed but the resulting IO server domains are irregularly sized and significantly overlap leading to difficulties in collating global fields from the resultant files. A solution has been tested which assigns ocean regions to IO servers using their equivalent rank in a decomposition which retains land-only regions rather than their actual rank in the ocean communicator. Early indications suggest this approach works and an early adoption of the replacement algorithm is hoped for.

------------------
jpnij /= jpni*jpnj
------------------
# io servers  | # ocean procs | buffersize | success? | time (s)|   comments
-----------------------------------------------------------------------------------------------------
     32       |      221      |    256     |     no   |   -     | " Plus de requete disponible !!!!"
     32       |      221      |    384     |     yes  |  2781   | Completed but output domains irregularly sized and overlap
     32       |      221      |    1024    |     no   |   -     | Out Of Memory error ("OOM killer terminated this process")
     32       |      221      |    512     |     yes  |  2873   | Also increased MPICH_UNEX_BUFFER_SIZE from 60MB to 128MB
     32       |      221      |    512     |     yes  |  2472   | First run with corrected colour assignment
     32       |      221      |    512     |     yes  |  2407   | run with final version of the corrected colour assignment
              |               |            |          |         |
     16       |      221      |    512     |     yes  |  3593   | run with final version of the corrected colour assignment 
     16       |      221      |    512     |     yes  |  2617   | ditto but with only 1 IO server on each node
     16       |      221      |    640     |     yes  |  2579   | ditto but with increased buffer size
     16       |      221      |    768     |     yes  |  2506   | ditto but with increased buffer size
     16       |      221      |    896     |     yes  |  2432   | ditto but with increased buffer size  
     16       |      221      |    960     |     yes  |  2433   | ditto but with increased buffer size
     16       |      221      |    1024    |     yes  |  2376   | ditto but with increased buffer size  <---- best
     16       |      221      |    1152    |     yes  |  2416   | ditto but with increased buffer size
     16       |      221      |    1280    |     yes  |  2434   | ditto but with increased buffer size

Status: Onging (NOCS-Southanpton)

S1.2 : NetCDF4 IO options

NetCDF4 offers the opportunity to employ dataset chunking and compression algorithms to greatly reduce the volume of data written out by NEMO without any loss of precision. NetCDF4.0.1 is the latest, full release of the library and this has been successfully tested for use in NEMO by NOCS. Even with severe tests, involving ORCA025 configurations with biogeochemical tracers, there are no run-time performance issues of any concern. In fact on some I/O limited clusters a speed enhancement is achieved because the benefits of writing out fewer data outweigh the computational costs of in-memory compression. The great benefit though is in the reduction of filesizes, for example:

File name   Disk usage for netCDF3.X (MBytes) Disk usage for netCDF4.0 (Mbytes) Reduction factor
*grid T*.nc         1500                                586                        2.56
*grid U*.nc          677                                335                        2.02
*grid V*.nc          677                                338                        2.00
*grid W*.nc         3300                                929                        3.55
*icemod*.nc          208                                145                        1.43

Table 11: Effect of chunking and compression on the size of the NEMO 3.0 output files.
For each file name the usage is the sum of all 221 individual processor output files.

The table above has been taken from a longer report produced by a UK Distributed Computational Science and Engineering (dCSE) project supervised by the NOCS-based members of the system team. The full report is available in either PDFor HTML

The code changes required to use netcdf4 within NEMO are relatively straight-forward and involve minor changes to: iom_def.F90, iom.F90, iom_nf90.F90 and restart.F90 in the NEMO code and: histcom.f90 in IOIPSL. Changes are also required in the relevant makefiles to link in the netcdf4 library (-lnetcdff -lnetcdf) and the HDF5 and compression libraries that underlie netcdf4 (-lhdf5_fortran -lhdf5 -lhdf5_hl -lz). The need to make changes to IOIPSL routines makes providing a solution which supports both netcdf3 and netcdf4 difficult (preprocessor keys are not used in IOIPSL). This requires some discussion. The uptake of netcdf4 in NEMO may also be restricted by slow uptake of 3rd party software providers. Any compilable code can simply be relinked to the new libraries. Existing utilities will then be able to read netCDF4 files (and write netCDF3 files). Code changes are only necessary if it is required to write new netCDF4 files. However, commercial packages, such as IDL, will need netCDF4 support provided by the vendor (planned for early 2010). Other packages may fare better; NOCS has been able to compile netCDF4-compatible versions of Ferret and a netCDF4-compatible version of the mexnc toolkit for use with matlab (versions 2008a onwards). Details are available on request. (NEMO-Southampton)

Status : Proof of concept successfully completed and report published. Code change branch maintained at NOCS and kept compatible with the trunk head revision. Ongoing investigation into knock-on effects on post-processing software.

S1.3 : Configuration manager

Create the configuration tools (CFG-tools): user-friendly interface, the tools, and the documentation associated to the creation of a new model configuration, and especially configurations defined as a zoom of an ORCA configuration. The tools include the generation of a grid, a bathymetry, an initial state, a forcing data set, and open boundary conditions. Manpower will be demanded in the My_Ocean FP7 European project (MCS) to build the tool.
Work :

(1) collect the existing tools (OPABAT, ROMS-tools, etc

(2) Define the structure and implement a 1st version of the CFG-tools. (v3.3)

Status : beginning August the 1st, within MyOcean? (Brice Lemaire position)

S1.4 : TOP-interfaces

Interface for other bio-models in TOP

Status : probably postponed for one year (==> 2010).

S1.5 : reference manuels

LIM and TOP (TRP) paper documentation using Latex (similar to the NEMO ocean engine documentation.
Work :

(1) Create the TRP documentation (2 w) (v3.3)

(2) participate to the writing of LIM3 documentation realised at UCL (1 w) (v3.3)

(3) review of the TRP and LIM3 documentation (NOCS team, 2 w) (v3.3)

Status : expected to start during the second part of year 2009.

S1.6 : Reference configurations

Add new standard configurations to be used as NEMO tutorial, developer validation and benchmark purposes in order to illustrate the existing system potentiality.
Work :

(1) illustration of the on-the-fly interpolation (v3.3)

(2) illustration of the different type of vertical coordinate (v3.3)

(3) illustration of AGRIF zoom with sea-ice (v3.3)

(4) illustration of off-line tracer computation (v3.3)