This project focuses on key Information Retrieval (IR) tasks, including:
- Positional Indexing – Phrase query searching using a positional index.
- TF-IDF Weighting & Cosine Similarity – Calculating term weights and identifying relevant documents.
- Inverted Index Construction – Indexing TED Talk descriptions for efficient retrieval.
📌 Datasets Used:
- Positional Index Sample Data (Manually provided).
- Document-Term Matrix (For TF-IDF calculations).
- TED Talks Dataset (
ted_main.csv
) – Extracting and processing text descriptions.
📌 Programming Language: Python 3
📌 Libraries Used: NLTK
, NumPy
, pandas
, math
, csv
- Queries Evaluated:
"fools rush in"
"fools rush in" AND "angels fear to tread"
- Task: Identify matching documents and positions from a given positional index.
- Issue Identified: The index may have inconsistencies affecting search accuracy.
- Task: Compute TF-IDF weights.
- Cosine Similarity: Determine the most relevant document: cos(θ) = (A • B) / (||A|| ||B||)
- Goal: Rank document similarity based on weighted term importance.
- Preprocessing: Tokenization, Lowercasing, Stopword Removal, Stemming.
- Index Output Files:
TED_term_index.csv
– Term-to-ID mapping with document frequency.TED_doc_index.csv
– Document-to-ID mapping (TED Talk URLs).TED_inverted_index.csv
– Term ID → (Doc ID, Term Frequency).
- Query Processing: Boolean AND retrieval for:
'climate' AND 'change'
'climate' AND 'fuel'
'artificial' AND 'intelligence'
'giant' AND 'troll'