This project implements a parallel inverted index using the Map-Reduce paradigm inspired by Google's MapReduce framework, implemented with Pthreads. Given a set of input text files, the program extracts unique words, determines which files contain them, and outputs a structured index for efficient word lookups.
Running with 4 Mapper threads (M=4) and 4 Reducer threads (R=4):
- Execution time: 0.39 seconds
- Speedup: 3.03x
-
Implements Map-Reduce for parallel word indexing
-
Uses Pthreads for parallelization
-
Outputs results in sorted files categorized alphabetically
-
Supports dynamic workload distribution for efficiency
-
Compatible with Docker for easy testing and deployment
-
Mapper Phase:
-
Reads assigned files
-
Extracts words and converts them to lowercase
-
Outputs {word, file_ID} pairs
-
-
Reducer Phase:
-
Aggregates {word, {file_IDs}} entries
-
Groups words by their starting letter
-
Sorts words by frequency across files
-
-
Output:
-
Generates one file per alphabet letter (a.txt, b.txt, etc.)
-
Words inside each file are sorted by frequency (descending)
-
If multiple words have the same frequency, they are sorted alphabetically
-
-
Compile using Makefile:
make
-
Run the program:
./map_reduce <Map_num_threads> <Reduce_num_threads> <input_file>
Map_num_threads
: Number of threads for the Mapper phaseReduce_num_threads
: Number of threads for the Reducer phaseinput_file
: Path to the input file containing the list of text files to index
-
To run the program with 2 Mapper threads, 3 Reducer threads, and the input file
input.txt
:./map_reduce 2 3 input.txt
-
Where
input.txt
contains the following:3 file1.txt file2.txt file3.txt
-
The program will output the inverted index in the following format:
-
a.txt
and:[2] are:[1] as:[1]
-
b.txt
blue:[1 2] birds:[1] bright:[1] brightly:[3] by:[2]
-