Skip to content

[QST] Some question about spark.rapids.sql.format.parquet.multiThreadedRead #5383

Answered by tgravescs
JustPlay asked this question in General
Discussion options

You must be logged in to vote
  1. partition = task in Spark. Each task is set to read a certain amount of data (128MB) by default. If you have a lot of small files, that task can be assigned multiple of those small files.
  2. no - in my testing I didn't see any difference in large file performance with it on.
  3. once next() is called it starts to read the files in the background thread pool and then it blocks on the next file being ready (it reads the files in the same order as the cpu version would)
  4. thread reads all of the file it is assigned. ie you get a different thread per file. if its a large file generally it would be assigned a portion of the file (128mb chunk). The config for number of threads and then how many to have…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by sameerz
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
2 participants
Converted from issue

This discussion was converted from issue #876 on April 28, 2022 23:10.