Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-32789 Fix potential race deadlock in Thor splitter #19235

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Oct 25, 2024

If the splitter reading arms (COutput) were reading from the same page (CRowSet chunk) as the write ahead was writing to, then the write ahead could expand that row set and cause the reader to read unexpected row data (e.g. null rows).
This caused the splitter arm to premuturely finish, leaving the splitter unbalanced and stalling as the writeahead blocked soon afterwards since the finished arm was too far behind.

The bug was that the row set should never expand. It is pre-sized to avoid that. However, the condition of 'fullness' was incorrect, relying only on a dynamic calculation of total row size. The fix is to also check that the number of rows does not exceed the capacity.

NB: The bug predates the new splitter code, however the new splitter implementation also changed the way the splitter arms interacted with writeahead. Previously the arm would call writeahead once it hit max explicitly, rather than blocking in the underlying ISharedSmartBuffer implementation.
But it would still be possible I think to hit this bug (albeit less likely).

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32789

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

@jakesmith jakesmith force-pushed the HPCC-32789-newsplitter-deadlock branch from d77e68b to cbe2b51 Compare October 25, 2024 08:35
@jakesmith jakesmith changed the title HPCC-32789 Fix potential race deadlock in new Thor splitter HPCC-32789 Fix potential race deadlock in Thor splitter Oct 25, 2024
@jakesmith jakesmith requested a review from ghalliday October 25, 2024 08:35
@jakesmith jakesmith force-pushed the HPCC-32789-newsplitter-deadlock branch from cbe2b51 to 0b2b661 Compare October 25, 2024 09:08
@@ -1464,6 +1469,15 @@ class CSharedWriteAheadBase : public CSimpleInterface, implements ISharedSmartBu
}
rowsRead++;
const void *retrow = rowSet->getRow(row++);
if (lastWasNull)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday - views on leaving this sanity checking in?

@@ -1395,6 +1399,7 @@ class CSharedWriteAheadBase : public CSimpleInterface, implements ISharedSmartBu
Owned<CRowSet> outputOwnedRows;
CRowSet *rowSet;
unsigned row, rowsInRowSet;
bool lastWasNull=false; // for sanity check only (there should never be two consequetive nulls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be set back to false in reset() - otherwise a splitter in a child query that has no records will probably trigger the failure. Alternatively we could remove this code as discussed.

If the splitter reading arms (COutput) were reading from the same
page (CRowSet chunk) as the write ahead was writing to, then the
write ahead could expand that row set and cause the reader to
read unexpected row data (e.g. null rows).
This caused the splitter arm to premuturely finish, leaving the
splitter unbalanced and stalling as the writeahead blocked soon
afterwards since the finished arm was too far behind.

The bug was that the row set should never expand. It is pre-sized
to avoid that. However, the condition of 'fullness' was incorrect,
relying only on a dynamic calculation of total row size. The fix
is to also check that the number of rows does not exceed the
capacity.

NB: The bug predates the new splitter code, however the new
splitter implementation also changed the way the splitter arms
interacted with writeahead. Previously the arm would call
writeahead once it hit max explicitly, rather than blocking in
the underlying ISharedSmartBuffer implementation.
But it would still be possible I think to hit this bug (albeit
less likely).

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
@jakesmith jakesmith force-pushed the HPCC-32789-newsplitter-deadlock branch from 0b2b661 to 35a34fd Compare October 25, 2024 10:50
@jakesmith jakesmith requested a review from ghalliday October 25, 2024 11:13
@ghalliday ghalliday merged commit dd68be7 into hpcc-systems:candidate-9.6.x Oct 25, 2024
53 checks passed
Copy link

Jirabot Action Result:
Added fix version: 9.6.60
Added fix version: 9.8.34
Workflow Transition: 'Resolve issue'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants