Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search query limited to 10000 records #357

Open
tanganellilore opened this issue Mar 7, 2024 · 4 comments
Open

Search query limited to 10000 records #357

tanganellilore opened this issue Mar 7, 2024 · 4 comments
Assignees
Labels

Comments

@tanganellilore
Copy link

Hi team,

i notice a Bug on search query, probably is connected to this issue and this old not migrated issue https://issues.sonatype.org/browse/NEXUS-16917.

I notice that if I perform a search query on a large repository, with more than 10000 elements but with pagination, i recevie a error from api like this:

RemoteTransportException[[159FCCBA-DE3F55B4-695C3AB7-3D759962-AA738D59][local[1]][indices:data/read/search[phase/query]]]; nested: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [10050]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.]; }

Log above is an example, but is very long and repated.

Some suggestion to solve it?

Thanks

P.S. repo have multiple folder with a lot of docker images, so i need to extract all of them.

@nblair
Copy link
Contributor

nblair commented Mar 12, 2024

Thanks for opening an issue @tanganellilore. The limit applied to search responses is intentional - such large datasets don't scale well for a system with an embedded database and embedded search engine. Without that in place, it's a recipe for OOM, which can cause the application to fail unexpectedly and result in database/index corruption and/or data loss.

What is your use case for queries that have such large result sets?

@elmbrain
Copy link

We have the same problem. The repository contains many artifacts and a search is needed for them all. Users have the right to decide how to limit the output. Previously, the index.max_result_window parameter in the elasticsearch configuration file worked. And it was a revelation to us that it was broken. It’s unclear why I hardcoded the parameter directly in the code. Set the parameter at the configuration file level so that it can be changed.

@tanganellilore
Copy link
Author

Thanks for opening an issue @tanganellilore. The limit applied to search responses is intentional - such large datasets don't scale well for a system with an embedded database and embedded search engine. Without that in place, it's a recipe for OOM, which can cause the application to fail unexpectedly and result in database/index corruption and/or data loss.

What is your use case for queries that have such large result sets?

Hi @nblair ,
I notice only now the answer, sorry for delay.
I need simply export all "metadata" of all assets in all repos and subfolder, like checksum, last download etc.... and save in a external DB, to track changes and delation for internal process, (without usage of audit webhook).

In my case i have a big repo with a lot of subfolder, all of them with like 30/100 assets. So this repo is bigger than 10k elements.
Via api, we can't get simply a list of subpath in the repo, to iterate over it (to reduce number of assets), so this is why i receive this error on call.

I understood that this limit is to avoid OOM, but with api we can't do it.

For my use case i used a groovy script that can be called and return this type of object per repository, but i notice also this that we have warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants