Opened 7 years ago
Closed 7 years ago
#91 closed defect (fixed)
Default buffer size seems to be too small
Reported by: | aclsce | Owned by: | rlacroix |
---|---|---|---|
Priority: | major | Component: | XIOS |
Version: | 2.0 | Keywords: | buffer size |
Cc: |
Description
I encounter the following problem : my run freezes when I define bounds lon and bounds lat via xios_set_domain_attr function (no problem without bounds definition).
Then, no problem when I put in my iodef.xml :
<variable id="optimal_buffer_size" type="string">performance</variable>
<variable id="buffer_size_factor" type="double">2.0</variable>
<variable id="min_buffer_size" type="int">10000000</variable>
It looks like bad estimation of buffer size...
Change History (9)
comment:1 Changed 7 years ago by rlacroix
- Owner changed from developer to rlacroix
- Status changed from new to assigned
comment:2 Changed 7 years ago by aclsce
Hi Remi,
test case with NEMO ORCA1_LIM3_PISCES configuration on 227 CPUs (221 NEMO + 6 XIOS servers) on Ada : /workgpfs/rech/gzi/rgzi016/TEST_CHRISTIAN_XIOS2_VALID_for_IDRIS
1) That does not work
llsubmit job_nemo
my run freezes...
2) That works by adding in iodef.xml
<variable id="optimal_buffer_size" type="string">performance</variable>
<variable id="buffer_size_factor" type="double">2.0</variable>
<variable id="min_buffer_size" type="int">10000000</variable>
Arnaud
comment:3 Changed 7 years ago by rlacroix
I'm not sure the problem is really caused by a bad estimation of the buffer size. I'm still investigating but it could be something more serious like a synchronization issue.
comment:4 follow-up: ↓ 5 Changed 7 years ago by rlacroix
It took me a while but I think I understand why there is a deadlock. The buffer size is correct but it does have an impact on the problem. As far as I can tell the problem really is in the communication protocol but it shouldn't be hard to fix.
comment:5 in reply to: ↑ 4 Changed 7 years ago by rlacroix
Replying to rlacroix:
As far as I can tell the problem really is in the communication protocol but it shouldn't be hard to fix.
In fact the issue might be more complex than expected. The problem I found was indeed quite easy to solve but another problem arose and this one is tricky. It will probably have to be discussed during the next team meeting.
comment:6 Changed 7 years ago by rlacroix
comment:7 Changed 7 years ago by rlacroix
I have a commit ready to fix the underlaying issue with the communication protocol (cf. https://github.com/RemiLacroix-IDRIS/XIOS/compare/master) but I would like to test it to be sure the performance impact is acceptable. Unfortunately the bug reported in ticket #98 prevents me from running big test cases for now so I'm delaying this commit.
comment:8 Changed 7 years ago by rlacroix
comment:9 Changed 7 years ago by rlacroix
- Resolution set to fixed
- Status changed from assigned to closed
The problem is fixed by r917. The performance impact should be limited but it might be noticeable.
Hello Arnaud,
Could you share a test case I can use to reproduce the problem?
Rémi