Skip to content

[CI] NodeEnvironmentTests testIndexCompatibilityChecks failing #124388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Mar 7, 2025 · 9 comments · Fixed by #127646
Closed

[CI] NodeEnvironmentTests testIndexCompatibilityChecks failing #124388

elasticsearchmachine opened this issue Mar 7, 2025 · 9 comments · Fixed by #127646
Labels
:Core/Infra/Core Core issues without another label low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":server:test" --tests "org.elasticsearch.env.NodeEnvironmentTests.testIndexCompatibilityChecks" -Dtests.seed=14F69652DC3A5F7C -Dtests.jvm.argline="-Des.entitlements.enabled=true" -Dtests.locale=eu -Dtests.timezone=Pacific/Johnston -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: 
Expected: (a string containing "Cannot start this node" and a string containing "it holds metadata for indices with version [4.29.21]" and a string containing "Revert this node to version [8.19.0]")
     but: a string containing "Revert this node to version [8.19.0]" was "Cannot start this node because it holds metadata for indices with version [4.29.21] with which this node of version [9.1.0] is incompatible. Revert this node to version [9.0.51] and delete any indices with versions earlier than [8.0.0] before upgrading to version [9.1.0]. If all such indices have already been deleted, revert this node to version [9.0.51] and wait for it to join the cluster to clean up any older indices from its metadata."

Issue Reasons:

  • [main] 4 failures in test testIndexCompatibilityChecks (1.6% fail rate in 244 executions)
  • [main] 2 failures in step part-1 (1.7% fail rate in 118 executions)
  • [main] 3 failures in pipeline elasticsearch-pull-request (2.6% fail rate in 116 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Mar 7, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 4 failures in test testIndexCompatibilityChecks (1.6% fail rate in 244 executions)
  • [main] 2 failures in step part-1 (1.7% fail rate in 118 executions)
  • [main] 3 failures in pipeline elasticsearch-pull-request (2.6% fail rate in 116 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 7, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra (Team:Core/Infra)

@mosche
Copy link
Contributor

mosche commented Mar 10, 2025

@thecoop Could you have a look, if I revert #124197 this passes again.

./gradlew ":server:test" --tests "org.elasticsearch.env.NodeEnvironmentTests.testIndexCompatibilityChecks" -Dtests.seed=14F69652DC3A5F7C -Dtests.jvm.argline="-Des.entitlements.enabled=true" -Dtests.locale=eu -Dtests.timezone=Pacific/Johnston -Druntime.java=24

@mosche mosche added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 10, 2025
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this issue Mar 11, 2025
@prdoyle
Copy link
Contributor

prdoyle commented May 2, 2025

@mosche - is the idea that NodeEnvironment.bestDowngradeVersion is returning the wrong version? Is it really supposed to suggest 8.19.0 rather than 9.0.51?

@mosche
Copy link
Contributor

mosche commented May 2, 2025

I'm not sure, haven't had a chance to look into this yet @prdoyle. On my list for next week when being on duty

@prdoyle
Copy link
Contributor

prdoyle commented May 2, 2025

I'm looking at it now. Here's my take so far:

The getBestDowngradeVersion does not (and currently cannot?) implement what its javadocs say it does. The javadocs:

We return (N-1).last, unless the user is trying to upgrade from a (N-1).last.x release, in which case we return the last installed version.

This requires us to identify the difference between major versions N and N-1, which is not exposed by the BuildVersion class. Therefore, we are returning the last installed version for anything on or after (N-1).last, which includes earlier N.x versions.

The main question I have is: is is possible to introduce an index version incompatibility in a minor ES version bump?

  • If it's impossible, then the difference is academic, because the failure reported here happens only when the current version is, say, N.1 and the previousNodeVersion is N.0. In this case, an index version incompatibility should be impossible so the fix would be to change the randomized test so it doesn't pick earlier versions with the same major as the current version. The javadocs are not wrong, but could be a bit clearer.
  • If it's possible, then bestDowngradeVersion's javadocs need tweaking to account for older versions with the current major. It should be returning the newer of previousNodeVersion and minimumCompatibilityVersion (which is what its code already does) and the unit tests should be changed accordingly.

In either case, I believe the code inside bestDowngradeVersion is correct, so this could be downgraded to low-risk because it's the tests that are wrong; however, the fix is relatively simple, so we probably should just get it done.

@prdoyle
Copy link
Contributor

prdoyle commented May 2, 2025

I'm proposing this for the javadocs of getBestDowngradeVersion:

     * Get a useful version string to direct a user's downgrade operation
     * <p>
     * Assuming that the index was compatible with {@code previousNodeVersion},
     * the user should downgrade to that {@code previousNodeVersion},
     * unless it's prior to the minimum compatible version,
     * in which case the user should downgrade to that instead.
     * (If the index version is so old that the minimum compatible version is incompatible with the index,
     * then the cluster was already borked before the node upgrade began,
     * and we can't probably help them without more info than we have here.)

@prdoyle
Copy link
Contributor

prdoyle commented May 2, 2025

Alright, based on the above, I'm proposing a fix in #127646.

@prdoyle prdoyle added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels May 2, 2025
@prdoyle
Copy link
Contributor

prdoyle commented May 2, 2025

Moved to low-risk because I think the functionality is correct and the tests are wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants