-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-32789 Fix potential race deadlock in Thor splitter #19235
HPCC-32789 Fix potential race deadlock in Thor splitter #19235
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32789 Jirabot Action Result: |
d77e68b
to
cbe2b51
Compare
cbe2b51
to
0b2b661
Compare
thorlcr/thorutil/thbuf.cpp
Outdated
@@ -1464,6 +1469,15 @@ class CSharedWriteAheadBase : public CSimpleInterface, implements ISharedSmartBu | |||
} | |||
rowsRead++; | |||
const void *retrow = rowSet->getRow(row++); | |||
if (lastWasNull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday - views on leaving this sanity checking in?
thorlcr/thorutil/thbuf.cpp
Outdated
@@ -1395,6 +1399,7 @@ class CSharedWriteAheadBase : public CSimpleInterface, implements ISharedSmartBu | |||
Owned<CRowSet> outputOwnedRows; | |||
CRowSet *rowSet; | |||
unsigned row, rowsInRowSet; | |||
bool lastWasNull=false; // for sanity check only (there should never be two consequetive nulls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be set back to false in reset() - otherwise a splitter in a child query that has no records will probably trigger the failure. Alternatively we could remove this code as discussed.
If the splitter reading arms (COutput) were reading from the same page (CRowSet chunk) as the write ahead was writing to, then the write ahead could expand that row set and cause the reader to read unexpected row data (e.g. null rows). This caused the splitter arm to premuturely finish, leaving the splitter unbalanced and stalling as the writeahead blocked soon afterwards since the finished arm was too far behind. The bug was that the row set should never expand. It is pre-sized to avoid that. However, the condition of 'fullness' was incorrect, relying only on a dynamic calculation of total row size. The fix is to also check that the number of rows does not exceed the capacity. NB: The bug predates the new splitter code, however the new splitter implementation also changed the way the splitter arms interacted with writeahead. Previously the arm would call writeahead once it hit max explicitly, rather than blocking in the underlying ISharedSmartBuffer implementation. But it would still be possible I think to hit this bug (albeit less likely). Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
0b2b661
to
35a34fd
Compare
Jirabot Action Result: |
If the splitter reading arms (COutput) were reading from the same page (CRowSet chunk) as the write ahead was writing to, then the write ahead could expand that row set and cause the reader to read unexpected row data (e.g. null rows).
This caused the splitter arm to premuturely finish, leaving the splitter unbalanced and stalling as the writeahead blocked soon afterwards since the finished arm was too far behind.
The bug was that the row set should never expand. It is pre-sized to avoid that. However, the condition of 'fullness' was incorrect, relying only on a dynamic calculation of total row size. The fix is to also check that the number of rows does not exceed the capacity.
NB: The bug predates the new splitter code, however the new splitter implementation also changed the way the splitter arms interacted with writeahead. Previously the arm would call writeahead once it hit max explicitly, rather than blocking in the underlying ISharedSmartBuffer implementation.
But it would still be possible I think to hit this bug (albeit less likely).
Type of change:
Checklist:
Smoketest:
Testing: