Skip to content

Latest commit

 

History

History
47 lines (36 loc) · 1.8 KB

README.md

File metadata and controls

47 lines (36 loc) · 1.8 KB

Scala scripts

getting started in scala and building small scripts... master test

sc1
Logo: EPFL computer since building École polytechnique fédérale de Lausanne

variables / types / operations

val / var xxx : Int, String, Array ...
val = Immutable / Var = Mutable

links

https://www.scala-lang.org/ https://www.dezyre.com/article/scala-vs-python-for-apache-spark/213 https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/ https://learn.datacamp.com/courses/introduction-to-scala https://docs.databricks.com/index.html https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/

SPARK

  • three data representation viz:
    • RDD (Resilient Distributed Database) - immutable

      • operations:
      1. Transformation (T) - operation applied on a RDD to create a new RDD
      2. Action (A)- operation applied on RDD that perform computation and send the result back to driver
      • Examples:
      • Map (T) - operation on each element of RDD - returns new RDD
      • Reduce (A) - Reduce by key
      • Apache Spark documentation
      • RDDs use Shared Variable : whenever a task is sent by a driver to executors program in a cluster, a copy of shared variable is sent to each node in a cluster Types:
      1. Broadcast - save the copy across al node
      2. Accumulator - aggregating the information
      • how to create RDD:
        1. Existing Storage - ex- a list to RDD
        2. External Sources - ex shared file system, HDFS, HBase data source offering a Haddop Input Format.
    • Dataframe ** faster > RDD (it as metadata)

    • Dataset