Opened 5 months ago
Last modified 3 months ago
#196 new defect
XIOS3 not recognising matching dimensions
Reported by: | acc | Owned by: | ymipsl |
---|---|---|---|
Priority: | major | Component: | XIOS |
Version: | 3.0 | Keywords: | |
Cc: |
Description
I'm having issues trying to create output containing fields with different source grids that map to the same global grid. In particular, this is a global nemo grid collated from MPP domains supplying a mixture of full MPP domain data and MPP domain data from the inner (haloes removed) section only. These grids are defined with the same name attribute:
<grid id="grid_U_3D" > <domain domain_ref="grid_U" /> <axis axis_ref="depthu" /> </grid> <grid id="grid_U_3D_inner" > <domain domain_ref="grid_U_inner" name="grid_U" /> <!-- use name="grid_U" so we don't duplicate x, y, dimensions --> <axis axis_ref="depthu" /> </grid>
but trying to write a file containing a mix of variables such as:
<field_group id="grid_U" grid_ref="grid_U_2D"> <field id="e3u" long_name="U-cell thickness" standard_name="cell_thickness" unit="m" grid_ref="grid_U_3D_inner" /> <field id="uoce" long_name="ocean current along i-axis" standard_name="sea_water_x_velocity" unit="m/s" grid_ref="grid_U_3D" /> . . <file id="file12" name_suffix="_grid_U" mode="write" gatherer="ugatherer" writer="uwriter" using_server2="true" description="ocean U grid variables" > <field field_ref="e3u" /> <field field_ref="uoce" name="uo". /> . .
results in:
cat xios_server_09.err In file "nc4_data_output.cpp", function "void xios::CNc4DataOutput::writeDomain_(xios::CDomain *)", line 476 -> On writing the domain : Opool__ugatherer_0__nemo__domain_undef_id_1 In the context : Opool__uwriter_0__nemo Error when calling function nc_def_dim(ncid, dimName.c_str(), dimLen, &dimId) NetCDF: String match to name in use Unable to create dimension with name: x and with length 180
180 is the correct size for the global grid but the two level servers in play (ugatherer and uwriter) do not seem to recognise that the dimensions have already been defined. The contents of the incomplete output file are:
ncdump -h O2L3P_LONG_5d_00010101_00010303_grid_U.nc netcdf O2L3P_LONG_5d_00010101_00010303_grid_U { dimensions: axis_nbounds = 2 ; x = 180 ; y = 148 ; depthu = 31 ; variables: float nav_lat(y, x) ; nav_lat:standard_name = "latitude" ; nav_lat:long_name = "Latitude" ; nav_lat:units = "degrees_north" ; float nav_lon(y, x) ; nav_lon:standard_name = "longitude" ; nav_lon:long_name = "Longitude" ; nav_lon:units = "degrees_east" ; float depthu(depthu) ; depthu:name = "depthu" ; depthu:long_name = "Vertical U levels" ; depthu:units = "m" ; depthu:positive = "down" ; depthu:bounds = "depthu_bounds" ; float depthu_bounds(depthu, axis_nbounds) ; depthu_bounds:units = "m" ; // global attributes: :name = "O2L3P_LONG_5d_00010101_00010303_grid_U" ; :description = "ocean U grid variables" ; :title = "ocean U grid variables" ; :Conventions = "CF-1.6" ;
This is using rev 2634 of XIOS3 with the following services:
<context id="xios" > <variable_definition> <variable_group id="buffer"> <variable id="min_buffer_size" type="int">400000</variable> <variable id="optimal_buffer_size" type="string">performance</variable> </variable_group> <variable_group id="parameters" > <variable id="using_server" type="bool">true</variable> <variable id="info_level" type="int">0</variable> <variable id="print_file" type="bool">false</variable> <variable id="using_server2" type="bool">false</variable> <variable id="transport_protocol" type="string" >p2p</variable> <variable id="using_oasis" type="bool">false</variable> </variable_group> </variable_definition> <pool_definition> <pool name="Opool" nprocs="12"> <service name="tgatherer" nprocs="2" type="gatherer"/> <service name="igatherer" nprocs="2" type="gatherer"/> <service name="ugatherer" nprocs="2" type="gatherer"/> <service name="pgatherer" nprocs="2" type="gatherer"/> <service name="twriter" nprocs="1" type="writer"/> <service name="uwriter" nprocs="1" type="writer"/> <service name="iwriter" nprocs="1" type="writer"/> <service name="pwriter" nprocs="1" type="writer"/> </pool> </pool_definition> </context>
Is there a trick to making this work correctly? Or any tips to where to look for errors in my setup?
Change History (8)
comment:1 Changed 5 months ago by acc
comment:2 Changed 5 months ago by acc
Just to confirm this issue is with XIOS3. A test with the standard ORCA2_ICE_PISCES SETTE configuration with NEMO and XIOS2 works as expected (32 ocean cores; 4, 1-level xios servers). Compiling the same test (with key_xios3) and linking with xios3 libraries produces a case which fails with:
In file "nc4_data_output.cpp", function "void xios::CNc4DataOutput::writeDomain_(xios::CDomain *)", line 476 -> On writing the domain : nemo__domain_undef_id_72 In the context : default_pool_id__default_writer_id_0__nemo Error when calling function nc_def_dim(ncid, dimName.c_str(), dimLen, &dimId) NetCDF: String match to name in use Unable to create dimension with name: x and with length 180
In this case the only change made to the standard set of inputs is the removal of these variable definitions in iodef.xml:
<variable id="using_oasis" type="bool">false</variable> <variable id="oasis_codes_id" type="string" >oceanx</variable>
which are not recognised by XIOS3 (and are not being used anyway)
comment:3 Changed 5 months ago by acc
There is not much hope of me fully understanding the XIOS code but I can't see how this code can robustly avoid duplicating meshes:
src/node/mesh.cpp: ///--------------------------------------------------------------- /*! * \fn bool CMesh::getMesh (StdString meshName) * Returns a pointer to a mesh. If a mesh has not been created, creates it and adds its name to the list of meshes meshList. * \param [in] meshName The name of a mesh ("name" attribute of a domain). * \param [in] nvertex Number of verteces (1 for nodes, 2 for edges, 3 and up for faces). */ CMesh* CMesh::getMesh (StdString meshName, int nvertex) { CMesh::domainList[meshName].push_back(nvertex); if ( CMesh::meshList.begin() != CMesh::meshList.end() ) { for (std::map<StdString, CMesh>::iterator it=CMesh::meshList.begin(); it!=CMesh::meshList.end(); ++it) { if (it->first == meshName) return &meshList[meshName]; else { CMesh newMesh; CMesh::meshList.insert( make_pair(meshName, newMesh) ); return &meshList[meshName]; } } } else { CMesh newMesh; CMesh::meshList.insert( make_pair(meshName, newMesh) ); return &meshList[meshName]; } MISSING_RETURN( "CMesh* CMesh::getMesh (StdString meshName, int nvertex)" ); return nullptr; }
should the algorithm be more like:
if ( CMesh::meshList.begin() != CMesh::meshList.end() ) { for (std::map<StdString, CMesh>::iterator it=CMesh::meshList.begin(); it!=CMesh::meshList.end(); ++it) { if (it->first == meshName) return &meshList[meshName]; } // else // { // Found no matches in existing list; insert a new mesh CMesh newMesh; CMesh::meshList.insert( make_pair(meshName, newMesh) ); return &meshList[meshName]; // } // } } else // First entry in list {
?
This makes no difference to the current issue (and the code is largely unchanged since XIOS2), but looks wrong to me.
comment:4 Changed 5 months ago by acc
Clearly not a good time of year to expect a response so I've had a bit of a dig myself. Looks like the new hashing is failing to find a match. Here's my instrumented part of io/nc4_data_output.cpp (inserted some info lines; look for DOM):
int globalHash = domain->computeAttributesHash( comm_file ); // Need a MPI_Comm to distribute without redundancy some attributs (value) StdString defaultNameKey = domain->getDomainOutputName(); info(1)<<"DOMdefID : " + defaultNameKey << endl ; if ( !relDomains_.count ( defaultNameKey ) ) { info(1)<<"DOMnotin : " + defaultNameKey << endl ; // if defaultNameKey not in the map, write the element such as it is defined relDomains_.insert( make_pair( defaultNameKey, make_pair(globalHash, domain) ) ); } else // look if a hash associated this key is equal { bool elementIsInMap(false); auto defaultNameKeyElements = relDomains_.equal_range( defaultNameKey ); for (auto it = defaultNameKeyElements.first; it != defaultNameKeyElements.second; it++) { info(1)<<"DOMhash : " << it->second.first << " ::: " << globalHash << endl ; if ( it->second.first == globalHash ) { // if yes, associate the same ids to current element domain->renameAttributesBeforeWriting( it->second.second ); StdString domid2 = domain->getDomainOutputName(); info(1)<<"DOMID2 : " + domid2 << endl ; elementIsInMap = true; } } // if no : inheritance has been excessive, define new names and store it (could be used by another grid) if (!elementIsInMap) // ! in MAP { domain->renameAttributesBeforeWriting(); StdString domid3 = domain->getDomainOutputName(); info(1)<<"DOMID3 : " + domid3 << endl ; relDomains_.insert( make_pair( defaultNameKey, make_pair(globalHash, domain) ) ) ; } } } StdString domid1 = domain->getDomainOutputName(); info(1)<<"DOMID1 : " + domid1 << endl ; if (domain->type == CDomain::type_attr::unstructured) { if (SuperClassWriter::useCFConvention) writeUnstructuredDomain(domain) ; else writeUnstructuredDomainUgrid(domain) ; return ; } CContext* context = CContext::getCurrent() ; if (domain->IsWritten(this->filename)) return; domain->checkAttributes(); if (domain->isEmpty()) if (SuperClass::type==MULTI_FILE) return; std::vector<StdString> dim0, dim1; StdString domid = domain->getDomainOutputName(); info(1)<<"DOMID : " + domid << endl ;
and here is the output that this generates:
cat xios_server_02.out -> info : Service default_writer_id created service size : 4 service rank : 2 on rank pool 2 -> info : Service default_writer_id created partition : 0 service size : 4 service rank : 2 on rank pool 2 -> info : Context default_pool_id__default_writer_id_0__nemo created, on local rank 2 and global rank 2 -> info : CServerContext::createIntercomm : No overlap ==> context in server mode -> info : DOMdefID : grid_T -> info : DOMnotin : grid_T -> info : DOMID1 : grid_T -> info : DOMID : grid_T -> info : DOMdefID : grid_T -> info : DOMhash : 18446744073246414776 ::: 735921135 -> info : DOMID3 : nemo__domain_undef_id_72 -> info : DOMID1 : nemo__domain_undef_id_72 -> info : DOMID : nemo__domain_undef_id_72
My guess is that XIOS2 used to match only on name but XIOS3 tries to be more thorough which leads to this change in behaviour.
comment:5 Changed 5 months ago by acc
Ah, a solution of sorts. The mismatch is coming from the hash of the attributes associated with the domain in computeAttributesHash (node/domain.cpp. By changing:
return distributedHash + globalHash;
to just:
return distributedHash ;
the xios3 test is now behaving like the xios2 one. E.g.:
-> info : Service default_writer_id created service size : 4 service rank : 2 on rank pool 2 -> info : Service default_writer_id created partition : 0 service size : 4 service rank : 2 on rank pool 2 -> info : Context default_pool_id__default_writer_id_0__nemo created, on local rank 2 and global rank 2 -> info : CServerContext::createIntercomm : No overlap ==> context in server mode -> info : DOMdefID : grid_T -> info : DOMnotin : grid_T -> info : DOMID1 : grid_T -> info : DOMID : grid_T -> info : DOMdefID : grid_T -> info : DOMhash : -567768377 ::: -567768377 -> info : DOMID2 : grid_T -> info : DOMID1 : grid_T -> info : DOMID : grid_T -> info : DOMdefID : grid_T -> info : DOMhash : -567768377 ::: -567768377 -> info : DOMID2 : grid_T -> info : DOMID1 : grid_T -> info : DOMID : grid_T
just not sure how safe this "fix" is in general.
comment:6 Changed 5 months ago by acc
Ok, a better fix: The hash mismatch in the original code was coming from the long_name attribute associated with the domains. One was 'grid T' and one was 'grid T inner'. The purpose of the excludedAttr list is to list those attributes not to include in the hash, so the minimal fix is to exclude the long_name attribute from the checks:
-
../node/domain.cpp
1848 1848 vector<StdString> excludedAttr; 1849 1849 //excludedAttr.push_back("name"); 1850 1850 // internal attributs 1851 excludedAttr.insert(excludedAttr.end(), { "long_name" }); 1851 1852 excludedAttr.insert(excludedAttr.end(), { "ibegin", "jbegin", "ni", "nj", "i_index", "j_index" });
Longer term, there probably ought to be a way of adding to this exclusion list via the XML.
comment:7 Changed 5 months ago by acc
For completeness, here is the additional debug line added temporarily to attribute_map.cpp to help find which attribute was different:
-
attribute_map.cpp
291 291 { 292 292 if (!el.second->isEmpty()) 293 293 { 294 info(1) << "ATTR: " + el.first + " : " << el.second->dump() << endl; 294 295 attrs_hash += el.second->computeHash(); 295 296 } 296 297 }
comment:8 Changed 3 months ago by jderouillat
The timing was indeed not good, but you found the right thing to do excluding the long_name.
I'll commit the fix and then close the issue.
An additional observation: If a different name is assigned to the inner grid, ie.:
then a file is successfully created with two sets of dimensions and coordinates:
etc.