Skip to content

A custom search engine built with Rust. It parses HTML files and utilizes TF-IDF scoring to rank document relevance based on search queries. The project includes a Rust-based backend server and vanilla HTML/CSS for the web frontend.

License

Notifications You must be signed in to change notification settings

KjetilIN/rs-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rust Search Engine

Created by Kjetil Indrehus

version Rust

This project is a custom search engine built with Rust, designed specifically for gutenberg.org books. It parses HTML files from the Kubernetes website and leverages TF-IDF (Term Frequency-Inverse Document Frequency) for scoring the relevance of documents, providing accurate and efficient search results. The search engine is composed of two main components: the backend server and the frontend interface. The backend server, implemented in Rust, processes HTML files, tokenizes the content, and calculates TF-IDF scores to determine the relevance of documents based on search terms. The frontend interface allows users to input search queries and view the ranked search results.

Screencast.from.24.mai.2024.kl.17.35.+0200.webm

Key Features

  • HTML Parsing: Efficiently parses HTML files from the Kubernetes website.
  • Tokenization: Breaks down the content into individual tokens (words) for analysis.
  • TF-IDF Scoring: Uses TF-IDF to score and rank documents based on their relevance to the search query.
  • Multithreaded Indexer: Parses documents concurrently using rayon for improved performance.
  • Rust Backend: Utilizes Rust and the tiny_http library to serve search requests.
  • Frontend Interface: Provides a frontend created by vanilla HTML/CSS.

Run Locally

Running the webserver:

cargo run serve

For indexing the HTML files to a file:

cargo run index file

For loading and viewing the files for the engine:

cargo run load

NOTE Set domain variable in ./frontend/script.js to 0.0.0.0:8080

Run with Docker

Running the application with docker is simple:

docker compose up --build 

(add -d option for running detached)

Screenshot from 2024-06-16 17-04-50

Read more about Docker Compose here.

NOTE Set domain variable in ./frontend/script.js to localhost:8080

Setup

There is two options for setting up the project.

  1. Use my files as documents (easiest)
  2. Setup your own search engine files

1. Use my files as document

  1. Unzip the pages directory locally:
tar -xvf ./cache/pages.tar.gz .
  1. Re-index the documents
cargo run parse file 
  1. Start the HTTP server locally
cargo run serve 

2. Setup your own search engine files

  1. Create a list of urls that you want to index. Each url must lead to a html file form the www.gutenberg.org website. Store them with the url and title separated with a semicolon in ./cache/urls.txt. For example:
https://www.gutenberg.org/cache/epub/57532/pg57532-images.html ; Passages from the Life of a Philosopher
https://www.gutenberg.org/cache/epub/69512/pg69512-images.html ; The calculus of logic
https://www.gutenberg.org/cache/epub/55280/pg55280-images.html ; An Enquiry into the Life and Legend of Michael Scot
....
  1. Create a pages directory
mkdir -p ./pages/
  1. Download each html file and set the name of each file equal to file<INDEX>.html, where INDEX is equal to the line number of the file in ./cache/urls.txt. Store each file in the ./pages/ directory.
  2. Re-index the documents
cargo run parse file 
  1. Start the HTTP server locally
cargo run serve 

Search API

Searching is executed by sending a POST request to the backend, with the search query as plain text

Endpoint

POST /api/search

Response

The following is a sample response from the output:

{
    "results": [
        {
            "url": "https://example.com",
            "title": "Example Domain",
            "tf_idf_score": 0.00234 
        },
        {
            "url": "https://anotherexample.com",
            "title": "Another Example",
            "tf_idf_score": 0.00234 
        }
    ]
}

Resources

Term Frequency–Inverse Document Frequency (tf-idf)
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

Kubernetes Website Repository:
https://github.com/kubernetes/website

About

A custom search engine built with Rust. It parses HTML files and utilizes TF-IDF scoring to rank document relevance based on search queries. The project includes a Rust-based backend server and vanilla HTML/CSS for the web frontend.

Topics

Resources

License

Stars

Watchers

Forks