Opened 5 months ago

Last modified 3 months ago

#196 new defect

XIOS3 not recognising matching dimensions

Reported by: acc Owned by: ymipsl
Priority: major Component: XIOS
Version: 3.0 Keywords:
Cc:

Description

I'm having issues trying to create output containing fields with different source grids that map to the same global grid. In particular, this is a global nemo grid collated from MPP domains supplying a mixture of full MPP domain data and MPP domain data from the inner (haloes removed) section only. These grids are defined with the same name attribute:

  <grid id="grid_U_3D" >
    <domain domain_ref="grid_U" />
    <axis axis_ref="depthu" />
  </grid>
  <grid id="grid_U_3D_inner" >
    <domain domain_ref="grid_U_inner" name="grid_U" /> <!-- use name="grid_U" so we don't duplicate x, y, dimensions -->
    <axis axis_ref="depthu" />
  </grid>

but trying to write a file containing a mix of variables such as:

  <field_group id="grid_U"   grid_ref="grid_U_2D">

    <field id="e3u"    long_name="U-cell thickness"             standard_name="cell_thickness"       unit="m"      grid_ref="grid_U_3D_inner"   />
 
    <field id="uoce"   long_name="ocean current along i-axis"   standard_name="sea_water_x_velocity" unit="m/s"    grid_ref="grid_U_3D"   />
.
.
        <file id="file12" name_suffix="_grid_U" mode="write" gatherer="ugatherer" writer="uwriter" using_server2="true" description="ocean U grid variables" >
          <field field_ref="e3u" />
          <field field_ref="uoce"         name="uo".   />
.
.

results in:

cat xios_server_09.err
In file "nc4_data_output.cpp", function "void xios::CNc4DataOutput::writeDomain_(xios::CDomain *)",  line 476 -> On writing the domain : Opool__ugatherer_0__nemo__domain_undef_id_1
In the context : Opool__uwriter_0__nemo
Error when calling function nc_def_dim(ncid, dimName.c_str(), dimLen, &dimId)
NetCDF: String match to name in use
Unable to create dimension with name: x and with length 180

180 is the correct size for the global grid but the two level servers in play (ugatherer and uwriter) do not seem to recognise that the dimensions have already been defined. The contents of the incomplete output file are:

ncdump -h O2L3P_LONG_5d_00010101_00010303_grid_U.nc
netcdf O2L3P_LONG_5d_00010101_00010303_grid_U {
dimensions:
	axis_nbounds = 2 ;
	x = 180 ;
	y = 148 ;
	depthu = 31 ;
variables:
	float nav_lat(y, x) ;
		nav_lat:standard_name = "latitude" ;
		nav_lat:long_name = "Latitude" ;
		nav_lat:units = "degrees_north" ;
	float nav_lon(y, x) ;
		nav_lon:standard_name = "longitude" ;
		nav_lon:long_name = "Longitude" ;
		nav_lon:units = "degrees_east" ;
	float depthu(depthu) ;
		depthu:name = "depthu" ;
		depthu:long_name = "Vertical U levels" ;
		depthu:units = "m" ;
		depthu:positive = "down" ;
		depthu:bounds = "depthu_bounds" ;
	float depthu_bounds(depthu, axis_nbounds) ;
		depthu_bounds:units = "m" ;

// global attributes:
		:name = "O2L3P_LONG_5d_00010101_00010303_grid_U" ;
		:description = "ocean U grid variables" ;
		:title = "ocean U grid variables" ;
		:Conventions = "CF-1.6" ;

This is using rev 2634 of XIOS3 with the following services:

 <context id="xios" >
    <variable_definition>
      <variable_group id="buffer">
        <variable id="min_buffer_size" type="int">400000</variable>
        <variable id="optimal_buffer_size" type="string">performance</variable>
      </variable_group>

      <variable_group id="parameters" >
        <variable id="using_server" type="bool">true</variable>
        <variable id="info_level" type="int">0</variable>
        <variable id="print_file" type="bool">false</variable>
        <variable id="using_server2" type="bool">false</variable>
        <variable id="transport_protocol" type="string" >p2p</variable>
        <variable id="using_oasis"      type="bool">false</variable>
      </variable_group>
    </variable_definition>
    <pool_definition>
     <pool name="Opool" nprocs="12">
      <service name="tgatherer" nprocs="2" type="gatherer"/>
      <service name="igatherer" nprocs="2" type="gatherer"/>
      <service name="ugatherer" nprocs="2" type="gatherer"/>
      <service name="pgatherer" nprocs="2" type="gatherer"/>
      <service name="twriter" nprocs="1" type="writer"/>
      <service name="uwriter" nprocs="1" type="writer"/>
      <service name="iwriter" nprocs="1" type="writer"/>
      <service name="pwriter" nprocs="1" type="writer"/>
     </pool>
    </pool_definition>
  </context>

Is there a trick to making this work correctly? Or any tips to where to look for errors in my setup?

Change History (8)

comment:1 Changed 5 months ago by acc

An additional observation: If a different name is assigned to the inner grid, ie.:

  <grid id="grid_U_3D" >
    <domain domain_ref="grid_U" />
    <axis axis_ref="depthu" />
  </grid>
  <grid id="grid_U_3D_inner" >
    <domain domain_ref="grid_U_inner" name="grid_UI" /> <!-- use name="grid_U" so we don't duplicate x, y, dimensions -->
    <axis axis_ref="depthu" />
  </grid>

then a file is successfully created with two sets of dimensions and coordinates:

ncdump -h O2L3P_LONG_5d_00010101_00010303_grid_U.nc
netcdf O2L3P_LONG_5d_00010101_00010303_grid_U {
dimensions:
	axis_nbounds = 2 ;
	x_grid_UI = 180 ;
	y_grid_UI = 148 ;
	depthu = 31 ;
	x_grid_U = 180 ;
	y_grid_U = 148 ;
	time_counter = UNLIMITED ; // (0 currently)
variables:
	float nav_lat_grid_UI(y_grid_UI, x_grid_UI) ;
		nav_lat_grid_UI:standard_name = "latitude" ;
		nav_lat_grid_UI:long_name = "Latitude" ;
		nav_lat_grid_UI:units = "degrees_north" ;
	float nav_lon_grid_UI(y_grid_UI, x_grid_UI) ;
		nav_lon_grid_UI:standard_name = "longitude" ;
		nav_lon_grid_UI:long_name = "Longitude" ;
		nav_lon_grid_UI:units = "degrees_east" ;
	float depthu(depthu) ;
		depthu:name = "depthu" ;
		depthu:long_name = "Vertical U levels" ;
		depthu:units = "m" ;
		depthu:positive = "down" ;
		depthu:bounds = "depthu_bounds" ;
	float depthu_bounds(depthu, axis_nbounds) ;
		depthu_bounds:units = "m" ;
	float nav_lat_grid_U(y_grid_U, x_grid_U) ;
		nav_lat_grid_U:standard_name = "latitude" ;
		nav_lat_grid_U:long_name = "Latitude" ;
		nav_lat_grid_U:units = "degrees_north" ;
	float nav_lon_grid_U(y_grid_U, x_grid_U) ;
		nav_lon_grid_U:standard_name = "longitude" ;
		nav_lon_grid_U:long_name = "Longitude" ;
		nav_lon_grid_U:units = "degrees_east" ;
	double time_centered(time_counter) ;
		time_centered:standard_name = "time" ;
		time_centered:long_name = "Time axis" ;
		time_centered:calendar = "noleap" ;
		time_centered:units = "seconds since 1900-01-01 00:00:00" ;
		time_centered:time_origin = "1900-01-01 00:00:00" ;
		time_centered:bounds = "time_centered_bounds" ;
	double time_centered_bounds(time_counter, axis_nbounds) ;
	double time_counter(time_counter) ;
		time_counter:axis = "T" ;
		time_counter:standard_name = "time" ;
		time_counter:long_name = "Time axis" ;
		time_counter:calendar = "noleap" ;
		time_counter:units = "seconds since 1900-01-01 00:00:00" ;
		time_counter:time_origin = "1900-01-01 00:00:00" ;
		time_counter:bounds = "time_counter_bounds" ;
	double time_counter_bounds(time_counter, axis_nbounds) ;
	float e3u(time_counter, depthu, y_grid_UI, x_grid_UI) ;
		e3u:standard_name = "cell_thickness" ;
		e3u:long_name = "U-cell thickness" ;
		e3u:units = "m" ;
		e3u:online_operation = "average" ;
		e3u:interval_operation = "5400 s" ;
		e3u:interval_write = "5 d" ;
		e3u:cell_methods = "time: mean (interval: 5400 s)" ;
		e3u:_FillValue = 1.e+20f ;
		e3u:missing_value = 1.e+20f ;
		e3u:coordinates = "time_centered nav_lat_grid_UI nav_lon_grid_UI" ;
	float uos(time_counter, y_grid_U, x_grid_U) ;
		uos:long_name = "ocean surface current along i-axis" ;
		uos:units = "m/s" ;

etc.

comment:2 Changed 5 months ago by acc

Just to confirm this issue is with XIOS3. A test with the standard ORCA2_ICE_PISCES SETTE configuration with NEMO and XIOS2 works as expected (32 ocean cores; 4, 1-level xios servers). Compiling the same test (with key_xios3) and linking with xios3 libraries produces a case which fails with:

In file "nc4_data_output.cpp", function "void xios::CNc4DataOutput::writeDomain_(xios::CDomain *)",  line 476 -> On writing the domain : nemo__domain_undef_id_72
In the context : default_pool_id__default_writer_id_0__nemo
Error when calling function nc_def_dim(ncid, dimName.c_str(), dimLen, &dimId)
NetCDF: String match to name in use
Unable to create dimension with name: x and with length 180

In this case the only change made to the standard set of inputs is the removal of these variable definitions in iodef.xml:

          <variable id="using_oasis"               type="bool">false</variable>
          <variable id="oasis_codes_id"            type="string" >oceanx</variable>

which are not recognised by XIOS3 (and are not being used anyway)

comment:3 Changed 5 months ago by acc

There is not much hope of me fully understanding the XIOS code but I can't see how this code can robustly avoid duplicating meshes:

src/node/mesh.cpp:

///---------------------------------------------------------------
/*!
 * \fn bool CMesh::getMesh (StdString meshName)
 * Returns a pointer to a mesh. If a mesh has not been created, creates it and adds its name to the list of meshes meshList.
 * \param [in] meshName  The name of a mesh ("name" attribute of a domain).
 * \param [in] nvertex Number of verteces (1 for nodes, 2 for edges, 3 and up for faces).
 */
  CMesh* CMesh::getMesh (StdString meshName, int nvertex)
  {
    CMesh::domainList[meshName].push_back(nvertex);

    if ( CMesh::meshList.begin() != CMesh::meshList.end() )
    {
      for (std::map<StdString, CMesh>::iterator it=CMesh::meshList.begin(); it!=CMesh::meshList.end(); ++it)
      {
        if (it->first == meshName)
          return &meshList[meshName];
        else
        {
          CMesh newMesh;
          CMesh::meshList.insert( make_pair(meshName, newMesh) );
          return &meshList[meshName];
        }
      }
    }
    else
    {
      CMesh newMesh;
      CMesh::meshList.insert( make_pair(meshName, newMesh) );
      return &meshList[meshName];
    }

    MISSING_RETURN( "CMesh* CMesh::getMesh (StdString meshName, int nvertex)" );
    return nullptr;
  }

should the algorithm be more like:

    if ( CMesh::meshList.begin() != CMesh::meshList.end() )
    {
      for (std::map<StdString, CMesh>::iterator it=CMesh::meshList.begin(); it!=CMesh::meshList.end(); ++it)
      {
        if (it->first == meshName)
          return &meshList[meshName];
      }
      //  else
      //  {
      // Found no matches in existing list; insert a new mesh
          CMesh newMesh;
          CMesh::meshList.insert( make_pair(meshName, newMesh) );
          return &meshList[meshName];
      //  }
     // }
    }
    else  // First entry in list
    {

?
This makes no difference to the current issue (and the code is largely unchanged since XIOS2), but looks wrong to me.

comment:4 Changed 5 months ago by acc

Clearly not a good time of year to expect a response so I've had a bit of a dig myself. Looks like the new hashing is failing to find a match. Here's my instrumented part of io/nc4_data_output.cpp (inserted some info lines; look for DOM):

          int globalHash = domain->computeAttributesHash( comm_file ); // Need a MPI_Comm to distribute without redundancy some attributs (value)

          StdString defaultNameKey = domain->getDomainOutputName();
          info(1)<<"DOMdefID : " + defaultNameKey << endl ;
          if ( !relDomains_.count ( defaultNameKey ) )
          {
            info(1)<<"DOMnotin : " + defaultNameKey << endl ;
            // if defaultNameKey not in the map, write the element such as it is defined
            relDomains_.insert( make_pair( defaultNameKey, make_pair(globalHash, domain) ) );
          }
          else // look if a hash associated this key is equal
          {
            bool elementIsInMap(false);
            auto defaultNameKeyElements = relDomains_.equal_range( defaultNameKey );
            for (auto it = defaultNameKeyElements.first; it != defaultNameKeyElements.second; it++)
            {
              info(1)<<"DOMhash : " <<  it->second.first << " ::: " <<  globalHash << endl ;
              if ( it->second.first == globalHash )
              {
                // if yes, associate the same ids to current element
                domain->renameAttributesBeforeWriting( it->second.second );
                StdString domid2 = domain->getDomainOutputName();
                info(1)<<"DOMID2 : " + domid2 << endl ;
                elementIsInMap = true;
              }
            }
            // if no : inheritance has been excessive, define new names and store it (could be used by another grid)
            if (!elementIsInMap)  // ! in MAP
            {
              domain->renameAttributesBeforeWriting();
              StdString domid3 = domain->getDomainOutputName();
              info(1)<<"DOMID3 : " + domid3 << endl ;
              relDomains_.insert( make_pair( defaultNameKey, make_pair(globalHash, domain) ) ) ;
            }
          }
        }
         StdString domid1 = domain->getDomainOutputName();
         info(1)<<"DOMID1 : " + domid1 << endl ;

        if (domain->type == CDomain::type_attr::unstructured)
        {
          if (SuperClassWriter::useCFConvention)
            writeUnstructuredDomain(domain) ;
          else
            writeUnstructuredDomainUgrid(domain) ;
          return ;
        }

         CContext* context = CContext::getCurrent() ;
         if (domain->IsWritten(this->filename)) return;
         domain->checkAttributes();

         if (domain->isEmpty())
           if (SuperClass::type==MULTI_FILE) return;


         std::vector<StdString> dim0, dim1;
         StdString domid = domain->getDomainOutputName();
         info(1)<<"DOMID : " + domid << endl ;

and here is the output that this generates:

cat xios_server_02.out
-> info : Service  default_writer_id created   service size : 4   service rank : 2 on rank pool 2
-> info : Service  default_writer_id created   partition : 0 service size : 4 service rank : 2 on rank pool 2
-> info : Context default_pool_id__default_writer_id_0__nemo created, on local rank 2 and global rank 2
-> info : CServerContext::createIntercomm : No overlap ==> context in server mode
-> info : DOMdefID : grid_T
-> info : DOMnotin : grid_T
-> info : DOMID1 : grid_T
-> info : DOMID : grid_T

-> info : DOMdefID : grid_T
-> info : DOMhash : 18446744073246414776 ::: 735921135
-> info : DOMID3 : nemo__domain_undef_id_72
-> info : DOMID1 : nemo__domain_undef_id_72
-> info : DOMID : nemo__domain_undef_id_72

My guess is that XIOS2 used to match only on name but XIOS3 tries to be more thorough which leads to this change in behaviour.

comment:5 Changed 5 months ago by acc

Ah, a solution of sorts. The mismatch is coming from the hash of the attributes associated with the domain in computeAttributesHash (node/domain.cpp. By changing:

     return distributedHash + globalHash;

to just:

     return distributedHash ;

the xios3 test is now behaving like the xios2 one. E.g.:

-> info : Service  default_writer_id created   service size : 4   service rank : 2 on rank pool 2
-> info : Service  default_writer_id created   partition : 0 service size : 4 service rank : 2 on rank pool 2
-> info : Context default_pool_id__default_writer_id_0__nemo created, on local rank 2 and global rank 2
-> info : CServerContext::createIntercomm : No overlap ==> context in server mode
-> info : DOMdefID : grid_T
-> info : DOMnotin : grid_T
-> info : DOMID1 : grid_T
-> info : DOMID : grid_T
-> info : DOMdefID : grid_T
-> info : DOMhash : -567768377 ::: -567768377
-> info : DOMID2 : grid_T
-> info : DOMID1 : grid_T
-> info : DOMID : grid_T
-> info : DOMdefID : grid_T
-> info : DOMhash : -567768377 ::: -567768377
-> info : DOMID2 : grid_T
-> info : DOMID1 : grid_T
-> info : DOMID : grid_T

just not sure how safe this "fix" is in general.

Last edited 5 months ago by acc (previous) (diff)

comment:6 Changed 5 months ago by acc

Ok, a better fix: The hash mismatch in the original code was coming from the long_name attribute associated with the domains. One was 'grid T' and one was 'grid T inner'. The purpose of the excludedAttr list is to list those attributes not to include in the hash, so the minimal fix is to exclude the long_name attribute from the checks:

  • ../node/domain.cpp

     
    18481848     vector<StdString> excludedAttr; 
    18491849     //excludedAttr.push_back("name"); 
    18501850     // internal attributs 
     1851     excludedAttr.insert(excludedAttr.end(), { "long_name" }); 
    18511852     excludedAttr.insert(excludedAttr.end(), { "ibegin", "jbegin", "ni", "nj", "i_index", "j_index" }); 

Longer term, there probably ought to be a way of adding to this exclusion list via the XML.

comment:7 Changed 5 months ago by acc

For completeness, here is the additional debug line added temporarily to attribute_map.cpp to help find which attribute was different:

  • attribute_map.cpp

     
    291291              { 
    292292                if (!el.second->isEmpty()) 
    293293                { 
     294                  info(1) << "ATTR: " + el.first + " : " << el.second->dump() << endl; 
    294295                  attrs_hash += el.second->computeHash(); 
    295296                } 
    296297              } 

comment:8 Changed 3 months ago by jderouillat

The timing was indeed not good, but you found the right thing to do excluding the long_name.

I'll commit the fix and then close the issue.

Note: See TracTickets for help on using tickets.