-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEZ-4580: Slow preemption of new containers when re-use is enabled #374
Conversation
🎊 +1 overall
This message was automatically generated. |
if (delayedContainers.remove(container)) { | ||
LOG.info("Removed {} from delayed containers", container.getContainer().getId()); | ||
} else { | ||
LOG.warn("Unknown container {} sent for removal. Ignoring.", container.getContainer().getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A valid case here is when a new container is allocated - added to delayedContainers, it is polled (removed) from queue but if there is no pending task request releaseContainer()
is invoked. Please suggest if the log level should be changed in info
or for both the newly added logs should be at debug
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we should change the log level to debug and print the only if LOG.isDebugEnabled() is true,
Rest of the code LGTM .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about the log levels, let's keep the simple remove path on debug, as looks like the happy preemption path:
LOG.debug("Removed {} from delayed containers", container.getContainer().getId());
isDebugEnabled is not necessary as long as we use the {} formatting (to prevent unnecessary string creation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Thank you @abstractdog for the review ! :)
…nmerged) apache#374 (cherry picked from commit aea86ca)
…enabled (unmerged) apache#374 (#24) (cherry picked from commit aea86ca)
…enabled (unmerged) apache#374 (#24) (cherry picked from commit aea86ca)
@abstractdog can you please take a look when you are free. Thanks in advance! :) |
@@ -2163,6 +2166,17 @@ void addDelayedContainer(Container container, | |||
} | |||
} | |||
|
|||
void removeDelayedContainer(HeldContainer container) { | |||
if (container != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this null check is not needed, as it's called from a block
if (delayedContainer != null) {
thanks @himanshu-mishra for taking care of this according to the description of TEZ-1742:
after checking and trying out the unit test and the fix here, it seems like this patch what's exactly fixes the abovementioned problem apart from the assertion error fix, the unit test indeed proves that releaseAssignedContainer is called as many times as needed according to the percentage config in a single round, so this looks good to me only left a minor comment regarding a null check |
Thanks @simhadri-g , @abstractdog for review. I have incorporated feedback as following:
Also adding a note that as the the changes fixes slow preemption, it leads to performance gain mentioned in TEZ-1742 when container re-use is enabled, which is enabled by default in Tez. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@himanshu-mishra : minor checkstyle left, can you please address: |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
When container reuse is enabled, preemption of lower priority containers that are not yet assigned to task, takes long time as they are released one at a time, and not the number of containers based when tez.am.preemption.percentage is high added in https://issues.apache.org/jira/browse/TEZ-1742.
Further investigation lead to following conclusion:
Warn log / Assertion error thrown because in preemptIfNeeded(), when releasing new containers, the loop counter is being decremented with each
releaseUnassignedContainers
, leading to looping only half number of times. By using another counter, assertion passes because of condition method returns with checkif (numPendingRequestsToService < 1) {
.In releaseContainer(), the container is not getting removed from
delayedContainers
queue and only fromheldContainers
map, hence same container is being picked up for release in every iteration till next cycle ofDelayedContainerManager
finds out that the container is not inheldContainers
and skips it with logSkipping delayed container as container is no longer running, containerId=...
This change adds a method in
DelayedContainerManager
to allow removal of delayed container and invokes it in releaseContainer method, which so far only removed it fromheldContainers
map.