Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch Snapshots? #435

Open
matan129 opened this issue Nov 17, 2022 · 2 comments
Open

Elasticsearch Snapshots? #435

matan129 opened this issue Nov 17, 2022 · 2 comments

Comments

@matan129
Copy link

matan129 commented Nov 17, 2022

Is your feature request related to a problem? Please describe.
Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3;
currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like
Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered
N/A

@zouzias
Copy link
Owner

zouzias commented Nov 17, 2022

Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered N/A

If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36

Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES.

Hope this helps.

PS. If your data are snapshotted using an internal ES snapshot representation the above solution will not work. You must have a copy of your data that you can easily read with Spark. In the past, it was a common practice to keep a backup of the ES indices to prevent data losses. Maybe these days things are more stable with ES.

@matan129
Copy link
Author

matan129 commented Nov 17, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants