Skip to content

What if you ".persist()" twice

andreacondorelli edited this page Sep 1, 2015 · 1 revision

.persist() is very usefull to save a lot of time when you use Spark. Basically, it stores on memory (or both on memory and on disk) the result of an action on an RDD.

myRDD = myRDD.persist(StorageLevel.MEMORY_AND_DISK)

myRDD.count()

Using .count() you will force Spark to compute each tuple and with persist you force Spark to store it. To check the store status, you can connect to http://SPARKIP:4040/storage/. If you use the same name for your RDD and you use .persist() several times, you will persist() on your system RDD you won't be able to unpersist:

myRDD = myRDD.persist(StorageLevel.MEMORY_AND_DISK)

myRDD.count()

myRDD = myRDD.join(myRDD).persist(StorageLevel.MEMORY_AND_DISK)

myRDD.count()

The result of the first line of code is unpersistable!

Take care because you are leaking your own memory and the only way to clear it is to reset your notebook/application!