This repository contains solutions to two assignments:
- Assignment 1: Implementation of a Search Engine for a given corpus.
- Assignment 2: Implementation of a Recommender System
This assignment involved developing a search engine capable of retrieving relevant information from a corpus of Twitter tweets. The project focused on applying foundational concepts of Information Retrieval, including indexing, querying, and ranking, to provide accurate and efficient search results.
- Case Folding: Converts all characters to lowercase to ensure consistency and reduce redundancy.
- Tokenization:
- Sentence Tokenization: Splits text into sentences to preserve contextual meaning.
- Word Tokenization: Splits sentences into individual words for granular analysis.
- Stop Word Removal: Removes common words that do not contribute significant meaning.
- Stemming and Lemmatization:
- Stemming: Reduces words to their root forms.
- Lemmatization: Produces meaningful dictionary forms of words.
- TF-IDF Vectorization: Converts text to numerical form, capturing term importance in a document and across the corpus.
- Dictionary:
- Used for inverted and positional indexing.
- Efficient for query intention analysis.
- Positional Index:
- Stores document IDs and positions of terms for precise phrase querying.
- List:
- Used for n-gram matching in language models.
- Sparse Matrix:
- Stores TF-IDF vectors to save memory and optimize similarity computations.
- Cosine Similarity: Measures text similarity regardless of document length.
- Likelihood Scoring: Considers the probability of query terms occurring together in text.
This assignment involved developing a Book Recommendation System to suggest books to users based on their interests. The system leverages a dataset of books, user profiles, and ratings to provide personalized recommendations, applying data preprocessing, cleaning techniques, and machine learning concepts to enhance recommendation accuracy.
- Books Table:
- Removed unnecessary columns (e.g., image URLs).
- Replaced null values and invalid data with appropriate defaults (e.g., mode of publication year).
- Standardized ISBN values.
- Users Table:
- Replaced invalid ages with the mean of valid ages.
- Split location into city, state, and country columns.
- Ratings Table:
- Removed invalid and duplicate rows.
- Ensured data consistency with the Books table.
A. Popularity-Based Recommendations
- Globally Popular Books: Ranked books by total ratings.
- Location-Based Popularity: Filtered books by user location and sorted by ratings.
- Author/Publisher-Based Recommendations: Recommended books by the same author or publisher.
- Yearly Popular Books: Grouped and ranked books by year of publication.
B. Weighted Average Rating Recommendations
- Used a weighted score formula to recommend books with high overall ratings.
C. Collaborative Filtering
- Built a user-item matrix and used cosine similarity to recommend books.
D. Correlation-Based Recommendations
- Computed a correlation matrix to find similar books.
E. Nearest Neighbor Recommendations
- Trained a Nearest Neighbors model on sparse matrices of user-book ratings.
F. Content-Based Recommendations
- Used TF-IDF feature vectors for unigrams and bigrams of book titles.
G. Hybrid Recommendations
- Combined content-based and collaborative filtering methods for enhanced recommendations.
- IPython Notebook: For interactive coding.
- scikit-learn: For machine learning models.
- NumPy and SciPy: For numerical operations.
- Pandas: For data manipulation.
- Matplotlib and Seaborn: For data visualization.