This project is a Scala-based implementation for calculating the Jaccard Index over a large dataset using Apache Spark and Hadoop. The Jaccard Index is a statistical measure used to evaluate the similarity between two sets. This implementation leverages Spark's distributed computing capabilities to efficiently process large datasets and compute the Jaccard Index.
The provided Scala code performs the following tasks:
-
Read Data:
- Reads data from Hadoop HDFS:
category_file
- Contains document IDs and their associated categories.terms_file
- Contains document IDs and the terms present in each document.stems_file
- Maps term IDs to their respective stems.
- Reads data from Hadoop HDFS:
-
Data Processing:
- STEMS: Maps term IDs to stems.
- CATEGORIES: Maps document IDs to categories.
- TERMS: Maps document IDs to terms and flattens them for processing.
- COGROUP: Co-groups categories and terms based on document IDs.
- INTERSECTION: Calculates the intersection between categories and terms for each document.
- JOIN: Joins the intersections with category counts and term counts to compute the Jaccard Index.
-
Calculate Jaccard Index:
-
The Jaccard Index is computed for each term-category pair using the formula:
Jaccard Index = ((Number of documents containing both term and category)/(Number of documents containing either term or category))
-
-
Output:
- The results are saved back to Hadoop HDFS in a specified directory.
- Java JDK 8 or higher
- Apache Spark: Compatible version with your Hadoop setup.
- Hadoop: Version compatible with your Spark setup.
- SBT (Scala Build Tool): For building and running the Scala code.
-
Spark Configuration:
- Master node is set to
local[*]
for local execution. - Adjust Spark configurations (e.g., memory and number of executors) based on your cluster specifications.
- Master node is set to
-
Hadoop Configuration:
- Ensure Hadoop is set up correctly and that paths to HDFS files are accurate.
-
Build the Project:
- Compile the Scala code and create a JAR file using SBT.
-
Submit the Spark Job:
- Use
spark-submit
to run the application. Ensure to adjust paths and settings as required:spark-submit \ --class SparkAnalytics \ --master local[*] \ path/to/your/jarfile.jar
- Use
-
Verify Output:
- The output will be saved to HDFS in the directory specified by
output
:- Example:
hdfs://localhost:9000/reuters/JaccardIndex
- Example:
- The output will be saved to HDFS in the directory specified by