#1522 closed Bug (fixed)
Bad MPP decomposition with key_nemocice_decomp
Reported by: | joakim | Owned by: | nemo |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | OCE | Version: | v3.6 |
Severity: | Keywords: | CICE MPI MPP OPA decomposition v3.6 | |
Cc: |
Description
With key_nemocice_decomp NEMO alters how it calculates the size of subdomains in nemogcm.F90 and mppini.F90.
In some cases, some subdomains can get negative jpi and/or jpj, which will raise MPI errors and NEMO will crash.
As an example, when using 1680 cores, NEMO automatically sets jpi=16, jpj=72, jpnij=1680, jpni=105, jpnj=8. This causes the subdomains on the eastern boundary to have jpi=-14, which causes NEMO to crash.
Commit History (0)
(No commits)
Change History (12)
comment:1 Changed 9 years ago by charris
comment:2 Changed 9 years ago by joakim
Hi Chris
Yes, I've dug into the code and it seems like NEMO does a much better MPI decompositions when using LIM instead.
I'm still in the process of making NEMO run with CICE in a regional "southern ocean only" configuration, so I'm still not sure on the number of processors I'll be using.
I might never even use jpnij=1680.
I'll see what I can come up with, but it does seem that the key_cice_decomp parts of the code might need some more work to prevent or at least warn the user when the decomposition is such that jpi or jpj become negative.
/Joakim
comment:3 Changed 8 years ago by nicolasmartin
- Keywords MPI added; mpi removed
comment:4 Changed 8 years ago by nicolasmartin
- Keywords CICE added; cice removed
comment:5 Changed 8 years ago by nicolasmartin
- Keywords nemo_v3_6* added
comment:6 Changed 8 years ago by nicolasmartin
- Keywords MPP nemo_v3_6_alpha added; mpp removed
comment:7 Changed 8 years ago by nicolasmartin
- Keywords nemo_v3_6_alpha removed
comment:8 Changed 8 years ago by nicolasmartin
- Keywords decomposition added; subdomain removed
comment:9 Changed 7 years ago by clevy
- Resolution set to fixed
- Status changed from new to closed
comment:10 Changed 6 years ago by nemo
- Keywords release-3.6* added; nemo_v3_6* removed
comment:11 Changed 6 years ago by nemo
- Keywords release-3.6* removed
comment:12 Changed 2 years ago by nemo
- Keywords OPA v3.6 added
I'm aware you can get problems with high numbers of processors which is because for NEMO you tell it the number of processors E-W and N-S whereas for CICE you tell it the block size (equivalent to jpi-2 and jpj-2) directly. So in this case CICE decides that for jpi=16 (block_size_x=14) it only needs 104 processors E-W which is actually the more sensible thing to do because otherwise you are using more processors than you need (regardless of whether you are using key_nemocice_decomp). Not sure the best way of persuading NEMO to do the same. Some better error trapping is required but really the answer here is just to use jpni=104. Hopefully we could get the land-suppression option to effectively do this automatically but that would probably require some code changes to the mpp_init2 routine to ensure that ilci/j are set correctly. There could also be assumptions in the NEMO code that the boundaries are always on the first / last processors.
More generally I've a question of whether you are really planning on running on this kind of decomposition for ORCA025 or whether you are just testing scaling etc.
Chris