Opened 5 years ago

Closed 5 years ago

#91 closed defect (fixed)

Default buffer size seems to be too small

Reported by: aclsce Owned by: rlacroix
Priority: major Component: XIOS
Version: 2.0 Keywords: buffer size
Cc:

Description

I encounter the following problem : my run freezes when I define bounds lon and bounds lat via xios_set_domain_attr function (no problem without bounds definition).
Then, no problem when I put in my iodef.xml :

<variable id="optimal_buffer_size" type="string">performance</variable>
<variable id="buffer_size_factor" type="double">2.0</variable>
<variable id="min_buffer_size" type="int">10000000</variable>

It looks like bad estimation of buffer size...

Change History (9)

comment:1 Changed 5 years ago by rlacroix

  • Owner changed from developer to rlacroix
  • Status changed from new to assigned

Hello Arnaud,

Could you share a test case I can use to reproduce the problem?

Rémi

comment:2 Changed 5 years ago by aclsce

Hi Remi,

test case with NEMO ORCA1_LIM3_PISCES configuration on 227 CPUs (221 NEMO + 6 XIOS servers) on Ada : /workgpfs/rech/gzi/rgzi016/TEST_CHRISTIAN_XIOS2_VALID_for_IDRIS

1) That does not work
llsubmit job_nemo
my run freezes...

2) That works by adding in iodef.xml

<variable id="optimal_buffer_size" type="string">performance</variable>
<variable id="buffer_size_factor" type="double">2.0</variable>
<variable id="min_buffer_size" type="int">10000000</variable>

Arnaud

comment:3 Changed 5 years ago by rlacroix

I'm not sure the problem is really caused by a bad estimation of the buffer size. I'm still investigating but it could be something more serious like a synchronization issue.

comment:4 follow-up: Changed 5 years ago by rlacroix

It took me a while but I think I understand why there is a deadlock. The buffer size is correct but it does have an impact on the problem. As far as I can tell the problem really is in the communication protocol but it shouldn't be hard to fix.

comment:5 in reply to: ↑ 4 Changed 5 years ago by rlacroix

Replying to rlacroix:

As far as I can tell the problem really is in the communication protocol but it shouldn't be hard to fix.

In fact the issue might be more complex than expected. The problem I found was indeed quite easy to solve but another problem arose and this one is tricky. It will probably have to be discussed during the next team meeting.

comment:6 Changed 5 years ago by rlacroix

I have pushed two commits that should help fixing this issue: r884 and r885.

The root cause is not really fixed but for now those changes should make deadlocks quite unlikely.

comment:7 Changed 5 years ago by rlacroix

I have a commit ready to fix the underlaying issue with the communication protocol (cf. https://github.com/RemiLacroix-IDRIS/XIOS/compare/master) but I would like to test it to be sure the performance impact is acceptable. Unfortunately the bug reported in ticket #98 prevents me from running big test cases for now so I'm delaying this commit.

comment:8 Changed 5 years ago by rlacroix

It seems that the NEMO test case requires the fixed the communication protocol, r884 and r885 are not sufficient to work around the problem.

comment:9 Changed 5 years ago by rlacroix

  • Resolution set to fixed
  • Status changed from assigned to closed

The problem is fixed by r917. The performance impact should be limited but it might be noticeable.

Note: See TracTickets for help on using tickets.