Skip to content

sambilkar/Apache-Spark-3-for-Data-Engineering-Analytics

Repository files navigation

Apache-Spark-3-for-Data-Engineering-Analytics

Introduction to Spark:

  1. PySpark is a library that can be used to run python application using Apache Spark Capability in other words PySpark is Python API for Spark.
  2. Spark is not programming language. a. write spark applications using Java, Scala, R and Python b. PySpark allows you to write python based data processing applications that execute on a distributed cluster in parallel

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and also machine learning applications.

Basic set- up for PySpark on Ubuntu for distributed Machine Learning. Prerequisites:

  1. An Ubuntu System
  2. Access to a terminal on command line
  3. A user with sudo or root permission

Required Packages:

  1. Apache Spark
  2. Java
  3. PySpark
  4. FindSpark
  5. SQL