Rust Search Engine

Created by Kjetil Indrehus

This project is a custom search engine built with Rust, designed specifically for gutenberg.org books. It parses HTML files from the Kubernetes website and leverages TF-IDF (Term Frequency-Inverse Document Frequency) for scoring the relevance of documents, providing accurate and efficient search results. The search engine is composed of two main components: the backend server and the frontend interface. The backend server, implemented in Rust, processes HTML files, tokenizes the content, and calculates TF-IDF scores to determine the relevance of documents based on search terms. The frontend interface allows users to input search queries and view the ranked search results.

Screencast.from.24.mai.2024.kl.17.35.+0200.webm

Key Features

HTML Parsing: Efficiently parses HTML files from the Kubernetes website.
Tokenization: Breaks down the content into individual tokens (words) for analysis.
TF-IDF Scoring: Uses TF-IDF to score and rank documents based on their relevance to the search query.
Multithreaded Indexer: Parses documents concurrently using rayon for improved performance.
Rust Backend: Utilizes Rust and the tiny_http library to serve search requests.
Frontend Interface: Provides a frontend created by vanilla HTML/CSS.

Run Locally

Running the webserver:

cargo run serve

For indexing the HTML files to a file:

cargo run index file

For loading and viewing the files for the engine:

cargo run load

NOTE Set domain variable in ./frontend/script.js to 0.0.0.0:8080

Run with Docker

Running the application with docker is simple:

docker compose up --build

(add -d option for running detached)

Setup

There is two options for setting up the project.

Use my files as documents (easiest)
Setup your own search engine files

1. Use my files as document

Unzip the pages directory locally:

tar -xvf ./cache/pages.tar.gz .

Re-index the documents

cargo run parse file

Start the HTTP server locally

cargo run serve

2. Setup your own search engine files

Create a list of urls that you want to index. Each url must lead to a html file form the www.gutenberg.org website. Store them with the url and title separated with a semicolon in ./cache/urls.txt. For example:

https://www.gutenberg.org/cache/epub/57532/pg57532-images.html ; Passages from the Life of a Philosopher
https://www.gutenberg.org/cache/epub/69512/pg69512-images.html ; The calculus of logic
https://www.gutenberg.org/cache/epub/55280/pg55280-images.html ; An Enquiry into the Life and Legend of Michael Scot
....

Create a pages directory

mkdir -p ./pages/

Download each html file and set the name of each file equal to file<INDEX>.html, where INDEX is equal to the line number of the file in ./cache/urls.txt. Store each file in the ./pages/ directory.
Re-index the documents

cargo run parse file

Start the HTTP server locally

cargo run serve

Search API

Searching is executed by sending a POST request to the backend, with the search query as plain text

Endpoint

POST /api/search

Response

The following is a sample response from the output:

{
    "results": [
        {
            "url": "https://example.com",
            "title": "Example Domain",
            "tf_idf_score": 0.00234 
        },
        {
            "url": "https://anotherexample.com",
            "title": "Another Example",
            "tf_idf_score": 0.00234 
        }
    ]
}

Resources

Term Frequency–Inverse Document Frequency (tf-idf)
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

Kubernetes Website Repository:
https://github.com/kubernetes/website

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
cache		cache
frontend		frontend
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rust Search Engine

Key Features

Run Locally

Run with Docker

Setup

1. Use my files as document

2. Setup your own search engine files

Search API

Endpoint

Response

Resources

About

Releases 2

Languages

License

KjetilIN/rs-search-engine

Folders and files

Latest commit

History

Repository files navigation

Rust Search Engine

Key Features

Run Locally

Run with Docker

Setup

1. Use my files as document

2. Setup your own search engine files

Search API

Endpoint

Response

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages