This project is a custom search engine built with Rust, designed specifically for gutenberg.org books. It parses HTML files from the Kubernetes website and leverages TF-IDF (Term Frequency-Inverse Document Frequency) for scoring the relevance of documents, providing accurate and efficient search results. The search engine is composed of two main components: the backend server and the frontend interface. The backend server, implemented in Rust, processes HTML files, tokenizes the content, and calculates TF-IDF scores to determine the relevance of documents based on search terms. The frontend interface allows users to input search queries and view the ranked search results.
Screencast.from.24.mai.2024.kl.17.35.+0200.webm
- HTML Parsing: Efficiently parses HTML files from the Kubernetes website.
- Tokenization: Breaks down the content into individual tokens (words) for analysis.
- TF-IDF Scoring: Uses TF-IDF to score and rank documents based on their relevance to the search query.
- Multithreaded Indexer: Parses documents concurrently using
rayon
for improved performance. - Rust Backend: Utilizes Rust and the tiny_http library to serve search requests.
- Frontend Interface: Provides a frontend created by vanilla HTML/CSS.
Running the webserver:
cargo run serve
For indexing the HTML files to a file:
cargo run index file
For loading and viewing the files for the engine:
cargo run load
NOTE Set domain
variable in ./frontend/script.js
to 0.0.0.0:8080
Running the application with docker is simple:
docker compose up --build
(add -d
option for running detached)
Read more about Docker Compose here.
NOTE Set domain
variable in ./frontend/script.js
to localhost:8080
There is two options for setting up the project.
- Use my files as documents (easiest)
- Setup your own search engine files
- Unzip the pages directory locally:
tar -xvf ./cache/pages.tar.gz .
- Re-index the documents
cargo run parse file
- Start the HTTP server locally
cargo run serve
- Create a list of urls that you want to index. Each url must lead to a html file form the www.gutenberg.org website. Store them with the url and title separated with a semicolon in
./cache/urls.txt
. For example:
https://www.gutenberg.org/cache/epub/57532/pg57532-images.html ; Passages from the Life of a Philosopher
https://www.gutenberg.org/cache/epub/69512/pg69512-images.html ; The calculus of logic
https://www.gutenberg.org/cache/epub/55280/pg55280-images.html ; An Enquiry into the Life and Legend of Michael Scot
....
- Create a pages directory
mkdir -p ./pages/
- Download each html file and set the name of each file equal to
file<INDEX>.html
, where INDEX is equal to the line number of the file in./cache/urls.txt
. Store each file in the./pages/
directory. - Re-index the documents
cargo run parse file
- Start the HTTP server locally
cargo run serve
Searching is executed by sending a POST request to the backend, with the search query as plain text
POST /api/search
The following is a sample response from the output:
{
"results": [
{
"url": "https://example.com",
"title": "Example Domain",
"tf_idf_score": 0.00234
},
{
"url": "https://anotherexample.com",
"title": "Another Example",
"tf_idf_score": 0.00234
}
]
}
Term Frequency–Inverse Document Frequency (tf-idf)
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/
Kubernetes Website Repository:
https://github.com/kubernetes/website