The "Questions" project is part of Harvard's CS50 AI course. It focuses on developing an AI system capable of answering questions by performing document retrieval and passage retrieval from a text corpus. The AI utilizes term frequency-inverse document frequency (tf-idf) to identify relevant documents and passages in response to user queries. This project helps in understanding the implementation of natural language processing (NLP) and information retrieval techniques.
The primary goal of this project is to implement a question-answering system that efficiently identifies and returns the most relevant passages from a set of documents. The system aims to enhance understanding of NLP and information retrieval techniques by leveraging tf-idf scoring.
The project is implemented in Python using the NLTK library. The main steps include:
- Loading Files: Load all text files from a specified directory.
- Tokenization: Convert documents into a list of words, filtering out punctuation and stopwords.
- Computing IDFs: Calculate Inverse Document Frequency (IDF) values for each word in the corpus.
- Query Processing: Tokenize and process user queries.
- Document Scoring: Score documents based on tf-idf and identify top matches.
- Sentence Extraction and Scoring: Extract sentences from top documents and score them based on query relevance.
-
Setup Environment: Ensure you have Python and NLTK installed. Download necessary NLTK data.
pip install nltk
import nltk nltk.download('punkt') nltk.download('stopwords')
-
Prepare Corpus: Place your text files in a directory (e.g.,
corpus
). -
Run the Project: Call the
main
function with the corpus directory and queries.path_to_corpus_directory = 'corpus' queries = [ "What are the types of supervised learning?", "How do neurons connect in a neural network?", "When was Python 3.0 released?" ] main(path_to_corpus_directory, queries)
- "What are the types of supervised learning?"
- "How do neurons connect in a neural network?"
- "When was Python 3.0 released?"
For each query, the system prints the most relevant answer along with the source document.
For more details on the project, please visit the CS50 AI Project page.