-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch Snapshots? #435
Comments
If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36 Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES. Hope this helps. PS. If your data are snapshotted using an internal ES snapshot representation the above solution will not work. You must have a copy of your data that you can easily read with Spark. In the past, it was a common practice to keep a backup of the ES indices to prevent data losses. Maybe these days things are more stable with ES. |
Hi, thanks for the response. The data in s3 is elasticsearch's own format,
not something standard like Parquet.
AFAIK, ES's format is just a lucene file, so I was wondering if this
library could be used for parsing it.
…On Thu, Nov 17, 2022, 16:48 Anastasios Zouzias ***@***.***> wrote:
*Is your feature request related to a problem? Please describe.* Nope.
The use case is new, but kind of related to this project - I have an
Elasticsearch cluster with large indices that are being snapshotted to S3.
I was wondering if I could somehow leverage luceneRDD to load the data
directly from S3; currently, I have Spark heavily query Elasticsearch,
which puts a lot of strain on the cluster. Usually I just need a full dump
of the data anyways, so I don't need sophisticated ES query capabilities
when dumps the data from ES to Spark.
*Describe the solution you'd like* Ideally?
sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch
snapshots are saved as "dumb dumps" of the Lucene index of every shard in
the Elasticsearch index. I though we might be able to parse these files
luceneRDD.
*Describe alternatives you've considered* N/A
If you have your data in S3, you can read your data from S3 to a Spark
DataFrame and then instantiate a LuceneRDD from your DataFrame. See for
example here:
https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36
Speaking of which, if you heavily batch query your ElasticSearch cluster
from Spark, you can easily put a lot pressure to ES.
Hope this helps.
—
Reply to this email directly, view it on GitHub
<#435 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALKCW3NDCWYQFQUD7DNACLWIZASLANCNFSM6AAAAAASDFICIY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Is your feature request related to a problem? Please describe.
Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage
luceneRDD
to load the data directly from S3;currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.
Describe the solution you'd like
Ideally?
sparkRDD.fromEs(<es_connection>)
. Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these filesluceneRDD
.Describe alternatives you've considered
N/A
The text was updated successfully, but these errors were encountered: