Spark Analytics Project: Jaccard Index Calculation

Overview

This project is a Scala-based implementation for calculating the Jaccard Index over a large dataset using Apache Spark and Hadoop. The Jaccard Index is a statistical measure used to evaluate the similarity between two sets. This implementation leverages Spark's distributed computing capabilities to efficiently process large datasets and compute the Jaccard Index.

Project Details

Code Description

The provided Scala code performs the following tasks:

Read Data:
- Reads data from Hadoop HDFS:
  - category_file - Contains document IDs and their associated categories.
  - terms_file - Contains document IDs and the terms present in each document.
  - stems_file - Maps term IDs to their respective stems.
Data Processing:
- STEMS: Maps term IDs to stems.
- CATEGORIES: Maps document IDs to categories.
- TERMS: Maps document IDs to terms and flattens them for processing.
- COGROUP: Co-groups categories and terms based on document IDs.
- INTERSECTION: Calculates the intersection between categories and terms for each document.
- JOIN: Joins the intersections with category counts and term counts to compute the Jaccard Index.
Calculate Jaccard Index:
- The Jaccard Index is computed for each term-category pair using the formula:
  
  Jaccard Index = ((Number of documents containing both term and category)/(Number of documents containing either term or category))
Output:
- The results are saved back to Hadoop HDFS in a specified directory.

Dependencies

Java JDK 8 or higher
Apache Spark: Compatible version with your Hadoop setup.
Hadoop: Version compatible with your Spark setup.
SBT (Scala Build Tool): For building and running the Scala code.

Configuration

Spark Configuration:
- Master node is set to local[*] for local execution.
- Adjust Spark configurations (e.g., memory and number of executors) based on your cluster specifications.
Hadoop Configuration:
- Ensure Hadoop is set up correctly and that paths to HDFS files are accurate.

Running the Code

Build the Project:
- Compile the Scala code and create a JAR file using SBT.
Submit the Spark Job:
- Use spark-submit to run the application. Ensure to adjust paths and settings as required:
```
spark-submit \
  --class SparkAnalytics \
  --master local[*] \
  path/to/your/jarfile.jar
```
Verify Output:
- The output will be saved to HDFS in the directory specified by output:
  - Example: hdfs://localhost:9000/reuters/JaccardIndex

Name	Name	Last commit message	Last commit date
Latest commit npapoutsakis Update README.md Sep 4, 2024 eea4d55 · Sep 4, 2024 History 14 Commits
Scenario_1	Scenario_1	created new folder for scenario 1	Jun 19, 2024
Scenario_2	Scenario_2	uploaded scenario 2	Jun 19, 2024
README.md	README.md	Update README.md	Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Analytics Project: Jaccard Index Calculation

Overview

Project Details

Code Description

Dependencies

Configuration

Running the Code

About

Releases

Packages

Languages

npapoutsakis/jaccard_index_analytics

Folders and files

Latest commit

History

Repository files navigation

Spark Analytics Project: Jaccard Index Calculation

Overview

Project Details

Code Description

Dependencies

Configuration

Running the Code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages