Opened 5 years ago

Closed 5 years ago

#98 closed defect (fixed)

Issue with DHT when the number of indexes is lower than the number of clients

Reported by: rlacroix Owned by: mhnguyen
Priority: critical Component: XIOS
Version: 2.0 Keywords: indexes, client, DHT
Cc:

Description

This can be reproduced using test_client.exe. The axis has 5 elements but as soon as you use more than 5 servers, the axis size used when computing the indexes will become equal to the number of clients.

The problem lies somewhere in CClientClientDHTInt class or in one of the classes it uses but I'm not familiar with this code so it would be easier if you could have a look at the issue.

Change History (8)

comment:1 Changed 5 years ago by mhnguyen

This bug only appears if we have distributed element (axis, domain) whose number of global indexes is less than number of client. Another (rare) case can this bug happen is the number of servers greater than the number of clients.
Fixed in r906

comment:2 Changed 5 years ago by rlacroix

I think it also affected non-distributed axis. I will check if it fixes the problem on the NEMO test case I use and report back.

Thanks Ha!

comment:3 follow-up: Changed 5 years ago by rlacroix

When using r906 I see a crash when XIOS is computing the indexes and when using r907 the initial problem is still here (the size of the axis becomes equal to the number of clients).

comment:4 in reply to: ↑ 3 Changed 5 years ago by rlacroix

Replying to rlacroix:

When using r906 I see a crash when XIOS is computing the indexes and when using r907 the initial problem is still here (the size of the axis becomes equal to the number of clients).

Where you able to reproduce the problem with r907?

comment:5 Changed 5 years ago by rlacroix

Here is how to clearly see the problem:

    // Compute the global index of grid from global index of each element.
    for (int i = 0; i < srcRank.size(); ++i)
    {
      size_t ssize = 1;
      int rankSrc = srcRank[i];
      std::vector<std::vector<size_t>* > globalIndexOfElementTmp(nbElement);
      std::vector<size_t> currentIndex(nbElement,0);
      for (int idx = 0; idx < nbElement; ++idx)
      {
        // Just display the size of each element of the grid
        error << rankSrc << " --> " << globalElementIndexOnServer[idx][rankSrc].size() << std::endl;
        ssize *= (globalElementIndexOnServer[idx][rankSrc]).size();
        globalIndexOfElementTmp[idx] = &(globalElementIndexOnServer[idx][rankSrc]);
      }
      globalIndexOnServer[rankSrc].resize(ssize);

Execute : mpirun -np 5 ../bin/test_client.exe : -np 1 ../bin/xios_server.exe, see the output for the first client:

0 --> 2000
0 --> 5
0 --> 2000
0 --> 5

then execute : mpirun -np 6 ../bin/test_client.exe : -np 1 ../bin/xios_server.exe, see the new output for the first client:

0 --> 1700
0 --> 6
0 --> 1700
0 --> 6

The first value is the domain size, it is smaller in the second case since the domain is distributed so everything is normal. The second value is the axis size, it changes from 5 to 6 which is not normal since the axis is not distributed and is supposed to have only 5 elements.

I hope it makes things clearer.

Last edited 5 years ago by rlacroix (previous) (diff)

comment:6 Changed 5 years ago by mhnguyen

To calculate the indexes to send from a client to a corresponding server, XIOS uses an algorithm based on distributed hashed table (DHT):

  1. First of all, for each element of grid, its indexes are "scattered" internally by XIOS (even it is not distributed): Each client holds a piece of information about the global index of this element (e.x: axis whose size is 5 with the global indexes 0,1,2,3,4 can be "scattered" among clients : client 0 holds index 0, client 1 holds index 1, ..., if there are more clients than the size of axis, there will be some clients holding a same index, e.x: if there are 6 clients, client 5 holds index 0).

With the known distribution on server side, each client can compute to which server the index it is holding belongs; and this information is used for initializing DHT

  1. Each client holds some indexes of an element, if this element is distributed (by user); otherwise, it holds all indexes of this element. By using DHT, each client can calculate which server it should send these indexes.
  2. DHT returns list of servers which an index belong to. And the size of this list is often one. However, there are cases in which element size is less than number of client and this list is longer, but it contains the duplicates of only one server.

The ssize counts for the this duplication and it gives a wrong size.
To correct this bug, I added a check to make sure the right size be return r914
With the tests that I have, I can verify that a right size is returned. But maybe it's not enough.
Remi, can you please retest NEMO for me?
Thanks.

comment:7 Changed 5 years ago by rlacroix

As far as I can tell the problem is fixed on my NEMO test case. Thanks!

Last edited 5 years ago by rlacroix (previous) (diff)

comment:8 Changed 5 years ago by mhnguyen

  • Keywords indexes client DHT added
  • Resolution set to fixed
  • Status changed from new to closed

Thanks.
So I think this ticket can be closed.

Note: See TracTickets for help on using tickets.