Elasticsearch Snapshots? #435

matan129 · 2022-11-17T09:40:43Z

Is your feature request related to a problem? Please describe.
Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3;
currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like
Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered
N/A

The text was updated successfully, but these errors were encountered:

zouzias · 2022-11-17T14:47:54Z

Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered N/A

If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36

Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES.

Hope this helps.

PS. If your data are snapshotted using an internal ES snapshot representation the above solution will not work. You must have a copy of your data that you can easily read with Spark. In the past, it was a common practice to keep a backup of the ES indices to prevent data losses. Maybe these days things are more stable with ES.

matan129 · 2022-11-17T15:38:35Z

Hi, thanks for the response. The data in s3 is elasticsearch's own format, not something standard like Parquet. AFAIK, ES's format is just a lucene file, so I was wondering if this library could be used for parsing it.

…

On Thu, Nov 17, 2022, 16:48 Anastasios Zouzias ***@***.***> wrote: *Is your feature request related to a problem? Please describe.* Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark. *Describe the solution you'd like* Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD. *Describe alternatives you've considered* N/A If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36 Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES. Hope this helps. — Reply to this email directly, view it on GitHub <#435 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALKCW3NDCWYQFQUD7DNACLWIZASLANCNFSM6AAAAAASDFICIY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch Snapshots? #435

Elasticsearch Snapshots? #435

matan129 commented Nov 17, 2022 •

edited

Loading

zouzias commented Nov 17, 2022 •

edited

Loading

matan129 commented Nov 17, 2022 via email

Elasticsearch Snapshots? #435

Elasticsearch Snapshots? #435

Comments

matan129 commented Nov 17, 2022 • edited Loading

zouzias commented Nov 17, 2022 • edited Loading

matan129 commented Nov 17, 2022 via email

matan129 commented Nov 17, 2022 •

edited

Loading

zouzias commented Nov 17, 2022 •

edited

Loading