What if you ".persist()" twice

.persist() is very usefull to save a lot of time when you use Spark. Basically, it stores on memory (or both on memory and on disk) the result of an action on an RDD.

myRDD = myRDD.persist(StorageLevel.MEMORY_AND_DISK)

myRDD.count()

Using .count() you will force Spark to compute each tuple and with persist you force Spark to store it. To check the store status, you can connect to http://SPARKIP:4040/storage/. If you use the same name for your RDD and you use .persist() several times, you will persist() on your system RDD you won't be able to unpersist:

myRDD = myRDD.persist(StorageLevel.MEMORY_AND_DISK)

myRDD.count()

myRDD = myRDD.join(myRDD).persist(StorageLevel.MEMORY_AND_DISK)

myRDD.count()

The result of the first line of code is unpersistable!

Take care because you are leaking your own memory and the only way to clear it is to reset your notebook/application!

Sparking
- Spark
  - How to...
    - [Use pySpark from CLI] (https://github.com/ContentWise/sparking/wiki/Use-Spark-(python)-from-CLI)
    - Split your sorted dataset in different files
    - [Order couple of items] (https://github.com/ContentWise/sparking/wiki/From-score-to-rank)
  - Solution for memory/disk leaks issues
    - The persist paradox
  - Best Practices
- Spark Streaming
  - How to
    - Kafka integration
    - Flume integration
  - Best Practices
Contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What if you ".persist()" twice

Clone this wiki locally