-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance SegmentStatusChecker to honor no-queryable servers and instance assignment config #14536
Enhance SegmentStatusChecker to honor no-queryable servers and instance assignment config #14536
Conversation
664d301
to
5697b33
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14536 +/- ##
============================================
+ Coverage 61.75% 63.86% +2.11%
- Complexity 207 1610 +1403
============================================
Files 2436 2705 +269
Lines 133233 151009 +17776
Branches 20636 23324 +2688
============================================
+ Hits 82274 96439 +14165
- Misses 44911 47354 +2443
- Partials 6048 7216 +1168
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if (numEVConsumingReplicas == 0) { // it's a immutable segment | ||
minEVImmutableReplicas = Math.min(minEVImmutableReplicas, numEVOnlineReplicas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because there might be a scenario in which all replicas for segment with IS in CONSUMING state are OFFLINE or ERROR. For this scenario, numEVConsumingReplicas is indeed zero and we should report it as zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing L365 as
if (numISConsumingReplicas == 0 && numEVConsumingReplicas == 0) { // it's a immutable segment
This should also cover the scenario that a mutable segment just transit to a immutable segment:
numISConsumingReplicas == 0 && numEVConsumingReplicas == 0
: it's must be a immutable segmentnumISConsumingReplicas != 0 || numEVConsumingReplicas != 0
: it's possible that this segment just transit to a immutable sgement, but we treat it as a mutable segment
5697b33
to
1f10cbf
Compare
if (minEVReplicas < maxISReplicas) { | ||
LOGGER.warn("Table {} has at least one segment running with only {} replicas, below replication threshold :{}", | ||
tableNameWithType, minEVReplicas, maxISReplicas); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sajjad-moradi This log need to be deleted, as
- This log will always (and falsely) being printed out when online and consuming segments have different replica group.
- L388-L392 already log the partial online segments. This log is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some new logging instead? Meanwhile, it would be great if we have a list of segments having min replicas and print them out as well? I feel sometimes the debuggability of alerts on this metric is not so great. nvm [L388-L392] should be good
4c5224a
to
bf098d3
Compare
Discussed internally that we should have an instance config cache and fetch on demand. Meanwhile, do we want to have 2 versions of the metric that maybe config-controlled or both available? In case anyone relies on the old. |
86898be
to
a39185a
Compare
a39185a
to
7ef229c
Compare
2a90e0e
to
f13578b
Compare
f13578b
to
57e8bdc
Compare
_serverStarters.add(startOneServer(i)); | ||
BaseServerStarter serverStarter = startOneServer(i); | ||
_serverStarters.add(serverStarter); | ||
_helixAdmin.enableInstance(getHelixClusterName(), serverStarter.getInstanceId(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ControllerPeriodicTasksIntegrationTest.testSegmentStatusChecker()
will fail without this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What has been changed that causes this test to fail without adding this line? Are we explicitly checking for HELIX_ENABLED
set to true?
What concerns me is that we shouldn't need to explicitly enable an instance, and we are not doing so when adding a new instance to the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SegmentStatusChecker.java L440 check if a server is queryable by checking if HELIX_ENABLED
set to true
private boolean isServerQueryable(ServerQueryInfo serverInfo) {
return serverInfo != null
&& serverInfo.isHelixEnabled()
&& !serverInfo.isQueriesDisabled()
&& !serverInfo.isShutdownInProgress();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think explicitly enabling is required. In InstanceConfig
:
public boolean getInstanceEnabled() {
return _record.getBooleanField(InstanceConfigProperty.HELIX_ENABLED.name(),
HELIX_ENABLED_DEFAULT_VALUE);
}
HELIX_ENABLED_DEFAULT_VALUE
is true
here.
Can you double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I previously use
boolean helixEnabled = record.getBooleanField(
InstanceConfig.InstanceConfigProperty.HELIX_ENABLED.name(), false);
where default value is false
that's why it was needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
all other testing succeeded. Only
full log
|
* This is a helper class that fetch server information from Helix/ZK. It caches the server information to avoid | ||
* repeated ZK access. This class is NOT thread-safe. | ||
*/ | ||
public class ServerInfoFetcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is too generic. I suggest to rename it to something like ServerQueryStateFetcher
.
Also this is only used in SegmentStatusChecker class. Let's make it private to that class (move it to that class), and later if there's any use elsewhere, it can be made public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not only contains state (helix_enabled, query_disabled, shutdown_in_progress), but also contains some other information (tables on that server, instance tags, and potentially @jasperjiaguo will add more). So I rename it as ServerQueryInfoFetcher
.
In addition, @jasperjiaguo will use this ServerQueryInfoFetcher
very soon in his other PRs, and it's under util
package, it's OK to be public. There are some other util class such as SegmentIntervalUtils
, AutoAddInvertedIndex
, they are used by only one class, but also be public.
@@ -80,6 +82,8 @@ public class SegmentStatusChecker extends ControllerPeriodicTask<SegmentStatusCh | |||
|
|||
private long _lastDisabledTableLogTimestamp = 0; | |||
|
|||
private ServerInfoFetcher _serverInfoFetcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not have this as a class member. The reason is the cache you have in the class is not cleared. If you create an ServerInfoFetcher
object in each run of the segment status checker, there's no need to worry about the cache eviction.
d4d5962
to
24660fb
Compare
@sajjad-moradi @jasperjiaguo comments addresses. PTAL. |
24660fb
to
bbba882
Compare
numEVReplicas++; | ||
String serverInstanceId = entry.getKey(); | ||
String segmentState = entry.getValue(); | ||
if (isServerQueryable(serverQueryInfoFetcher.getServerQueryInfo(serverInstanceId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(totally optional) maybe encapsulate this in the serverQueryInfoFetcher as well
String serverInstanceId = entry.getKey(); | ||
String segmentState = entry.getValue(); | ||
if (isServerQueryable(serverQueryInfoFetcher.getServerQueryInfo(serverInstanceId)) | ||
&& (segmentState.equals(SegmentStateModel.ONLINE) || segmentState.equals(SegmentStateModel.CONSUMING))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(totally optional) We can probably make this a set called acceptableStates and do acceptableStates.contains(segmentState)
@Jackie-Jiang could you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR still WIP? I don't see the change related to checking instance assignment config
private PinotHelixResourceManager _pinotHelixResourceManager; | ||
private Map<String, ServerQueryInfo> _cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) They can be final
return new ServerQueryInfo(instanceId, tags, null, helixEnabled, queriesDisabled, shutdownInProgress); | ||
} | ||
|
||
public static class ServerQueryInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need setters for this class, and we can make all fields final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems most of the fields are not used. Are you planning to use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems most of the fields are not used. Are you planning to use them?
Yes, @jasperjiaguo will need those fields very soon
private boolean _helixEnabled; | ||
private boolean _queriesDisabled; | ||
private boolean _shutdownInProgress; | ||
private ServerQueryInfo(String instanceName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(format) Add an empty line, and reformat the constructor per Pinot Style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
boolean helixEnabled = record.getBooleanField( | ||
InstanceConfig.InstanceConfigProperty.HELIX_ENABLED.name(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use instanceConfig.getInstanceEnabled()
.
We can simplify the ServerQueryInfo
to combine all 3 booleans, and call it queryEnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use instanceConfig.getInstanceEnabled().
done
We can simplify the ServerQueryInfo to combine all 3 booleans, and call it queryEnabled
Although in this PR it's OK to combine all the 3 boolean into a single one, it's better to keep them separate. We may will have a PR to emit a metric about "how many or how many percentage of servers for a table that were marked as QUERIES_DISABLED
". We has some situation where some server been configured as QUERIES_DISABLED
during debugging/testing but we forget to recover the config and it cause problem some time later, so we may need emit a metric for that specifically.
_cache = new HashMap<>(); | ||
} | ||
|
||
public ServerQueryInfo getServerQueryInfo(String instanceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotate the return as @Nullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return _cache.computeIfAbsent(instanceId, this::getServerQueryInfoOndemand); | ||
} | ||
|
||
private ServerQueryInfo getServerQueryInfoOndemand(String instanceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private ServerQueryInfo getServerQueryInfoOndemand(String instanceId) { | |
@Nullable | |
private ServerQueryInfo fetchServerQueryInfo(String instanceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
numEVReplicas++; | ||
String serverInstanceId = entry.getKey(); | ||
String segmentState = entry.getValue(); | ||
if (isServerQueryable(serverQueryInfoFetcher.getServerQueryInfo(serverInstanceId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check segment state first because the overhead for that is much smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Jackie-Jiang we don't need to check instance assignment config specifically, if it has different number of replicas for OFFLINE & CONSUMING segments, we just need to avoid to use |
ref: #14538
This PR fix enhance SegmentStatusChecker in two scenarios:
If user has instance assignment config, a user may make offline segments and consuming segments have different replica groups (e.g. use more replicas for consuming segments, but less replicas for online segments).
In this situation, when calculate
PERCENT_OF_REPLICAS
, it's problematic to use globalmaxISReplicas
as denominator and globalminEVReplicas
as nominator. For example, if user config replica groups for immutable segments as 3, and 5 for mutable segments,PERCENT_OF_REPLICAS
will always be 60% even if all replicas are up, which will cause some false negative alerts.This PR change the logic to calculate EVReplicasUpPercent for each segment then emit the minimal percentage.
When a sever is configured as
queriesDisabled
orshutdownInProgress
, broker does not send queries to the server. We need technically treat those replicas that on the no-queryable server as "OFFLINE" even if it shows "ONLINE" in helix