Version 4 (modified by frrh, 4 years ago) (diff)

Running eORCA1 on a single Cray XC40 PE

Background notes

Higher resolutions (e.g. eORCA025) cannot run on a single node so we concentrate solely on eORCA1 for the purposes of this work.

Tim Graham has run a gyre ORCA2 configuration successfully on a single PE.

Here we aim to go up a level.

Procedures and observations

It turns out that if we take an eORCA1 NEMO-CICE GO6-type configuration and employ 1x1 with no separate XIOS server PEs then the job fails with a very clear message from icbinit (iceberg initialisation) that it is not possible to run with 1 PE in the X direction….

Note: we have to manually adjust the aprun command line options to get what we need because the logic in the existing controls (in suite.rc) breaks down when running on 1x1 (and possibly all odd numbers) of PEs.

The reasons for that are not clear (why would iceberg code be any different from the main NEMO (or CICE) code. That seems odd but I don't propose to pursue it.

So try switching off icebergs….

Well this seems to submit and start running but times out with no suggestion that it has got very far (no time.step file etc). It's not clear if things are failing somewhere in the NEMO code, in CICE or in the IO.

Try extending the run time and shortening the total run to 6 hours…

This seems to abort in XIOS with an allocation problem.

How about we add separate XIOS procs back to the job?

Setting up to run with 8 separate XIOS processes in detached mode… that fails too.

It seems that regardless of how long we give the model to run, it reads (at least some of) the NEMO namelists and then just hangs.

Tim says this smacks of an XIOS problem.

He suggests switching off XIOS (in the external libraries control of fcm_make_ocean) and deactivating key_iomput.

So I do this and set the run to go for 6 hours (8x45 min TS in this case.)

This actually seems to work and sure enough at the end of the run we have a single NEMO restart file for the whole domain.

Conclusions

So it seems we can run eORCA1 on a single PE with the caveats that:

  • we must turn off icebergs
  • we must not compile with key_iomput
  • we must turn off XIOS completely from the compilation (i.e. not merely leave it to run in attached mode).

However:

  • Single PE configurations force the code through physically different code paths in numerous places. So particularly with optimisation switched on we wouldn't get the same results, hence there's no guarantee that you can do any useful tests by comparing a single PE run with a multiple PE run. Further, any discrepancies might be more prone to compe from teh single PE code since this is unlikely to have undergone less, if any, serious scientific validation than the multi PE code path.

So not exactly an unqualified success but better than we might have hoped and potentially giving us something to work with should we need it.