Skip to content

kevin-chao-com/Spark-for-Machine-Learning-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

b1fbb86 · Feb 15, 2024

History

38 Commits
Feb 12, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 15, 2024
Feb 12, 2024
Feb 12, 2024
Feb 15, 2024

Repository files navigation

Spark-for-Machine-Learning-AI

This is the my lesson notes and exercises for a LinkedIn course, Spark-for-Machine-Learning-AI.

  1. Introduction to Spark and MLlib
  2. Data Preparation and Transformation
    • Numeric:
      • MinMaxScaler
      • StandardScaler
      • Bucketizer
    • Text:
      • Tokenizer
      • HashingTF
  3. Clustering
    • K-Mean
    • Hierarchical clustering with Bisecting K-means
  4. Classification
    • Navie Bayes
    • Multilayer perceptron
    • Decision trees
  5. Regression
    • Linear regression
    • Decision tree regression
    • Gradient-boosted tree regression (requiredd significant time to build the model)
  6. Recommendations
    • Collaborative Filtering
      • In Spark: Using Alternating Least Squares method
    • Content-Based Filtering
  7. Tips for using Spark MLlib:
    • (1) Processing:
      • Collect, reformat, and transform data
        • Load data into Spark DataFrames
        • Include headers, or column names, in text file
        • Use inferSchema=True
        • Use StringIndexer to map from string to numeric indexes
    • (2) Model Building:
      • Apply machine learning algorithms to training data
        • Split data into trainging and test sets
        • Fit models using trainging data
        • Create predictions by applying a transform to the test data
    • (3) Validation:
      • Assess the quality of models built in step 2
        • Use MLlib evaluators:
          • MulticlassClassificationEvaluator
          • RegressionEvaluator
        • Experimeny with multiple algorithms
        • Vary hyperparameters
    • Other suggestions:
      • (1) MLlibs Docs:
        • Detailed API documentation and examples
      • (2) Kaggle:
        • Data sets and articles
      • (3) AWS Data Sets:
        • Big data and public data sets

About

This is the my lesson notes and exercises for a LinkedIn course.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published