Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: tune heuristics for choosing the next package to solve #10146

Merged
merged 1 commit into from
Feb 8, 2025

Conversation

radoering
Copy link
Member

@radoering radoering commented Feb 4, 2025

  • Do not consider anymore if a dependency has a specific marker or not, since there is no good explanation for this criterion.
  • Consider if a package has dependencies and especially how many dependencies with a constraint with an upper bound, because these are more likely to cause conflicts later.

Pull Request Check List

Related to: #9956
Requires: python-poetry/poetry-core#833

  • Added tests for changed code.
  • Updated documentation for changed code.

This reduces the time to lock the first example from #9956 (comment) from over 400 s to less than 30 s while not increasing the time for any other example. See #9956 (comment) for alternative (simpler but less explainable) variants, which all had a negative influence on at least one example.

Summary by Sourcery

Enhancements:

  • Prioritize packages with dependencies, especially those with upper-bound constraints, when selecting the next package to resolve.

Copy link

sourcery-ai bot commented Feb 4, 2025

Reviewer's Guide by Sourcery

This pull request improves the heuristics for selecting the next package to resolve during dependency resolution. It prioritizes packages with dependencies, especially those with upper-bound constraints, as these are more likely to cause conflicts. The changes also remove a less effective criterion based on specific markers.

Class diagram for updated Preference class

classDiagram
    class Preference {
        +enum Priority
        -_provider
        -_solution
        -_dependency_cache
        +_get_min(dependency) tuple
    }

    class Priority {
        <<enumeration>>
        NO_CHOICE
        DIRECT_ORIGIN
        USE_LATEST
        LOCKED
        DEFAULT
    }

    note for Preference "Modified _get_min to consider:
    - Number of dependencies
    - Upper bound constraints
    - Removed marker specificity check"
Loading

File-Level Changes

Change Details Files
Modified the package selection logic to prioritize dependencies with upper bound constraints and the existence of dependencies.
  • Removed the check for specific markers on dependencies.
  • Added logic to count dependencies with upper bound constraints.
  • Prioritized packages with more dependencies having upper bounds.
  • Prioritized packages with dependencies over those without.
  • Modified the return tuple of _get_min to include the new criteria.
src/poetry/mixology/version_solver.py
Fixed an incompatibility message in a test case.
  • Reordered the dependencies in the error message to match the actual conflict.
tests/mixology/version_solver/test_unsolvable.py
Cached the package method of the pool.
  • Cached the _pool.package method using functools.cache and renamed it to get_package_from_pool.
  • Used the cached method get_package_from_pool when completing a package.
src/poetry/puzzle/provider.py
Updated the poetry-core dependency to a specific branch.
  • Changed the poetry-core dependency to a specific branch on a fork.
pyproject.toml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @radoering - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Please add tests to verify the new package selection heuristics, particularly around packages with upper bound constraints being prioritized.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

src/poetry/mixology/version_solver.py Show resolved Hide resolved
@radoering radoering force-pushed the perf-next-package-heuristics branch from 365263e to add03e5 Compare February 4, 2025 18:29
@dimbleby
Copy link
Contributor

dimbleby commented Feb 5, 2025

Consider if a package has dependencies and especially how many dependencies with a constraint with an upper bound, because these are more likely to cause conflicts later.

Interesting idea! It makes some sort of intuitive sense, which is nice.

If what you have makes things better, and you have had enough of trying things out here, then feel free to ignore the rest of this - which is mostly speculation / curiosity.

  • I wonder why you are only checking whether a package has any dependencies and not the count? If this is a good signal: then I would expect that having 20 dependencies is a better signal than having 2?

  • Previously we used "has many versions available" as a sort of proxy for "is boto3" but without having to be quite so specific about it. I guess now it is "has no dependencies" that handles this case without quite saying so: presumably we now do not select urllib3 until very late

  • but in that case, does our reversal of "how many versions are available" still make sense? The original fewest-versions-first still sounds likely to find conflicts sooner. So if we now have found a different way of avoiding the known pathological cases then perhaps our reversal of that original algorithm choice should be undone

  • though I wonder how often the fourth value of this tuple is ever used as a tie-break anyway. Especially if we were to count dependencies, rather than only use a boolean has-dependencies, perhaps it rarely comes into play?

  • also: if it is "has dependencies" that is sufficient to handle the boto3 / urllib3 case - well then I wonder when "has upper bound dependencies" is even making a difference?

If what you have makes things better - or sometimes much better and mostly not worse - then perhaps it is not worth trying too hard to optimize further.

But if you feel like experimenting I guess I am curious about the effect of these three changes (either independently or together): (i) drop the count of upper-bound-dependencies (ii) len() rather than bool() of all dependencies (iii) un-reverse our ordering based on count of available versions

@radoering
Copy link
Member Author

you have had enough of trying things out here

This feels like something you have never tried enough. I have considered some of the above points, but not others - and did not measure/document it systematically. I think I will do some further tests.

@radoering
Copy link
Member Author

After having done some additional measurements and analysis, it still looks like the current version is the best version considering the known examples:

  • I wonder why you are only checking whether a package has any dependencies and not the count? If this is a good signal: then I would expect that having 20 dependencies is a better signal than having 2?

It seems to be a worse signal than the number of available versions (especially for #9956 (comment)). The only thing that is sure is that no dependencies is a clear sign that it should be chosen last. I think it is quite the same as with "number of available versions": Only one available version is a clear sign that it should be chosen first, but otherwise it is unclear.

  • Previously we used "has many versions available" as a sort of proxy for "is boto3" but without having to be quite so specific about it. I guess now it is "has no dependencies" that handles this case without quite saying so: presumably we now do not select urllib3 until very late

That is correct.

  • but in that case, does our reversal of "how many versions are available" still make sense? The original fewest-versions-first still sounds likely to find conflicts sooner. So if we now have found a different way of avoiding the known pathological cases then perhaps our reversal of that original algorithm choice should be undone

Good point. I had not thought about that. However, the original algorithm still seems to be worse for #9956 (comment).

  • though I wonder how often the fourth value of this tuple is ever used as a tie-break anyway. Especially if we were to count dependencies, rather than only use a boolean has-dependencies, perhaps it rarely comes into play?

Even if it does not come into play often it is still worth for determinism. See

In order to provide results that are as deterministic as possible
and consistent between `poetry lock` and `poetry update`, the return value
of two different dependencies should not be equal if possible.

But as it seems the boolean has-dependencies (so that the fourth value comes into play more often) does better anyway.

  • also: if it is "has dependencies" that is sufficient to handle the boto3 / urllib3 case - well then I wonder when "has upper bound dependencies" is even making a difference?

"has dependencies" is sufficient to handle boto3/urllib3, but it is not sufficient to handle other similar cases:

  • boto3/botocore vs. urllib3
    • boto3: 3 deps, 3 upper bound
    • botocore: 4 deps, 4 upper bound
    • urllib3: no deps
  • sphinx/sphinx-rtd-theme vs. docutils
    • sphinx: 17 deps, 1 upper bound
    • sphinx-rtd-theme: 3 deps, 3 upper bound
    • docutils: no deps
  • dbt-semantic-interfaces/opentelemetry-api vs. importlib-metadata
    • dbt-semantic-interfaces: 9 deps, 9 upper bound
    • opentelemetry-api: 2 deps, 1 upper bound
    • importlib-metadata: 2 deps, 0 upper bound

To put it in a nutshell, I would keep the current state of the PR since it seems to be the best considering the currently known examples.

@dimbleby
Copy link
Contributor

dimbleby commented Feb 6, 2025

Interesting stuff, thanks for investigating!

I still wonder whether simplification - or at least compactification - is possible. eg if we think that the actual count of dependencies is uninteresting: perhaps we do not care either about the actual count of has-upper-bound dependencies. Just dealing first with the packages that have any upper bound dependency may be enough? (and then with the packages that have any sort of dependency, and finally with leaf packages).

then the tuple could be simplified to a triple with middle element 0/1/2 eg as int(has-upper-bound-dependencies) + int(has-any-dependencies)

and I confess I am surprised if it remains favourable to go most-available-versions-first. At the time I thought that making that swap was likely giving up a little performance, in most cases, in order to get it back on the known terrible case. But data trumps intuition (though I wonder whether we have enough examples to call it compelling...)

@radoering
Copy link
Member Author

then the tuple could be simplified to a triple with middle element 0/1/2 eg as int(has-upper-bound-dependencies) + int(has-any-dependencies)

It think it had to be 0 if has_upper_bound_deps else (1 if has_deps else 2) or the inverse of your version.

I tried and it makes no difference in many cases, but it makes #9956 (comment) slightly slower (28 s vs 31 s) and #9956 (comment) significantly slower (480 s vs 620 s).

and I confess I am surprised if it remains favourable to go most-available-versions-first.

I assume that there are more smaller boto3 vs. urllib3 cases and the gain in these cases is much higher than the loss in the standard case.

though I wonder whether we have enough examples to call it compelling...

We surely not have enough examples but as long as we have no counter-examples, I think it is the best to work with what we have:

  • It definitively makes sense to consider if there are dependencies with an upper bound.
  • It definitively makes sense to consider if there are any dependencies at all.
  • All the other decisions highest/lowest number of versions first, bool or integer number of dependencies are based on 1-2 examples but as long as we do not have a counter-example I think it makes sense to stick with what is best for these.

@dimbleby
Copy link
Contributor

dimbleby commented Feb 7, 2025

Yeah. I feel slighly uncomfortable about the possibility of overfitting to a few examples, or to the state of the world as it is exactly today. But not enough to disagree with you.

- Do not consider anymore if a dependency has a specific marker or not, since there is no good explanation for this criterion.
- Consider if a package has dependencies and especially how many dependencies with a constraint with an upper bound, because these are more likely to cause conflicts later.
@radoering radoering force-pushed the perf-next-package-heuristics branch from add03e5 to f23ec65 Compare February 8, 2025 05:34
@radoering radoering requested a review from a team February 8, 2025 10:25
@abn abn merged commit 5dc292b into python-poetry:main Feb 8, 2025
53 checks passed
@radoering radoering mentioned this pull request Feb 9, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants