Opened 5 years ago

Closed 18 months ago

#1600 closed Defect (fixed)

memory leaks in NEMO3.6 stable

Reported by: clem Owned by: mocavero
Priority: normal Milestone:
Component: OCE Version: release-3.6
Severity: minor Keywords:
Cc: mocavero

Description

I think NEMO has some serious memory leaks which can result in simulations crashes if the time integration is long enough. This is certainly what I am experiencing for quite a while now in a very high resolution simulation (see ticket #1561).

I tested the memory use on ORCA2-LIM3 (on the IBM machine ADA from Idris with intel processors and ifort compiler). I ran a couple of simulations (on both 1-processor and multi-processors). Sebastien Masson tested also GYRE on his MAC OSX and found leaks too. Here are my main findings:

Simulation 63 processors (nn_fsbc=1 and 5):
If the coupling with the surface boundary is made every time step (nn_fsbc=1), the max virtual memory of one chosen processor increases constantly (100 Mb after about 15,000 time steps), while the max physical memory (RSS max) increases by 150 Mb.
If the coupling with the atmosphere is made every 5 time steps (nn_fsbc=5), the vm max and rss max also increase but at a slower pace (15 Mb and 50 Mb).

By looking at memory use for several routines called in step.F90, it looks like the leak does not come from the surface module SBC (unexpectedly) but from the "pure" OPA routines. It seems to imply many many routines (LDF, TRA, ZDF etc), so I was not yet able to clearly identify which ones are problematic.

Simulation 1 processor (nn_fsbc=1):
I do not see the same behavior in mono-processor. VM max and RSS max are also increasing but it is much slower and it happens at a regular pace (~0.1 Mb at every ~1,000 time steps). Moreover the leak is always in zdf_tke (I did not find why).

I am clearly out of my comfort zone here. We (Seb and I) have checked the allocations of temporary arrays. I also checked for regular bugs with a bunch a debugging ifort options, and with valgrind. But no bugs show up.


Commit History (0)

(No commits)

Attachments (1)

1600-memory leaks in NEMO3.6 stable.pdf (1.7 MB) - added by mocavero 3 years ago.
Analysis with the Allinea tool

Download all attachments as: .zip

Change History (8)

comment:1 Changed 5 years ago by acc

Given the number of routines implicated and the apparent absence of a major problem in the single processor case, then the message passing components are probable sources. First task would be to distinguish between NEMO's internal messages and XIOS's. Assuming these tests were done with key_iomput, can one be repeated without key_iomput? Or even with key_iomput but comparing detached vs attached modes might be informative.

comment:2 Changed 5 years ago by clem

I get the same behavior without key_iomput (at least for ORCA2_LIM3).

comment:3 Changed 5 years ago by clevy

  • Owner changed from NEMO team to HPC working group...

comment:4 Changed 4 years ago by clevy

After discussion within HPC working group, Martin Schreiber suggests to use the memory leak detection tools of Allinea: http://www.allinea.com/memory-debugging-advanced-memory-debugger-and-memory-leak-detection-c-c-and-f90-applications
Those are indeed available on the french computing centres (TGCC and IDRIS) and should be helpful to investigate the problem (including making a pause during the simulation, etc…)

Changed 3 years ago by mocavero

Analysis with the Allinea tool

comment:5 Changed 3 years ago by mocavero

The analysis of the memory leaks has been performed on the ATHENA system (at CMCC) with the Allinea tool. The test of an old revision (6287) of the NEMO3.6 stable showed the presence of memory leaks when XIOS1 was used in attached mode. After discussing within the HPC-WG it was decided to extend the analysis to NEMO-XIOS2. The NEMO3.6 stable rev. 7654 has been analyzed with both XIOS1 and XIOS2. No memory leaks have been found with and without XIOS (both 1 and 2). Results of the analysis are reported in the attached file. A comparison among the two revisions is needed to understand the main changes. Similar tests on other HPC systems could be useful before closing the ticket.

comment:6 Changed 3 years ago by clevy

  • Cc mocavero added
  • Owner changed from HPC working group... to mocavero
  • Status changed from new to assigned
  • Type changed from Bug to Defect

comment:7 Changed 18 months ago by mocavero

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.