PDF Knowledge Base Builder and Search Interface

This repository provides a suite of Python scripts to create a searchable knowledge base from PDF files, enabling efficient and intuitive text search with or without embeddings. The primary focus is simplicity and portability, avoiding the need for a formal database while offering advanced search capabilities.

Main interface

Data folders

Graphical application to process data folders (folders with PDF files)

News

Automatic detection of platform (Linux or Windows) to set the default PDF viewer (Okular in Linux and Acrobat Reader in Windows). (March 9, 2025)
Added option in the main application BM25-String-Embed-Rerank-PDF-Search.py to toggle cropping the PDF view. Some PDF files crop incorrectly, so this option gives the user the capability viewing the uncropped PDF view for problematic PDFs. (March 4, 2025)
Graphical application to process folders in batches, create-JSON-EMB.py, now has a toggle for processing files in batches (uses more memory. If the script crashes disable this option). Also added an option to make the use of the Fastembed library optional (if the Fastembed library is not installed, the option to create EMB files will be disabled). (March 4, 2025)
Now the creation of the database (JSON and EMB) files is easier! New graphical application to process folders in batches create-JSON-EMB.py. Binaries for this new script soon!. (February 2, 2025)
Windows and Linux binaries updated (January 22, 2025)
Fixed critical bug that kept the script BM25-String-Embed-Rerank-PDF-Search.py from running. Updated binaries soon (January 17, 2025)
Windows and Linux binaries are available now! Please see the Releases section.
Now the user can specify and manage which folders to load at runtime. List of folders is saved in the file folders.ini

Features

Quickly search through a library of PDF files with an intuitive interface.
Easily navigate and visualize search results using the built-in GUI.
Allows opening search results in an external PDF viewer (Okular by default).
Extracts text from PDFs and saves them as JSON files.
Supports embedding generation for enhanced search capabilities.
Optional integration with the fastembed library for embedding-based search.
Tested in Linux and Windows 10

Installation

Requirements

Install the required libraries:

pip install pymupdf4llm
pip install PyQt5
pip install bm25s

Optional Libraries

To enable embedding-based search:

pip install fastembed

Pros of using fastembed:

Enables embedding-based search for semantic relevance.
Supports reranking with embeddings.

Cons if fastembed is not installed:

Only BM25 and simple text search methods are available.

External Dependency

The script assumes the Okular PDF viewer is installed on your system. If not, install it using your package manager (e.g., sudo apt install okular on Ubuntu).

Workflow

1. Minimum Workflow (JSON files only)

Run create-JSON-EMB.py to extract text from PDFs into JSON files. This is usually a long process but it needs to be done only once. Use the "Add Folders" button to add folders with PDF files to the queue. Uncheck the checkboxes from the "Process EMB" column. Press the "Start Processing" button to start the JSON and EMB creation.
Use the BM25-String-Embed-Rerank-PDF-Search.py script to search the JSON files using BM25 or simple text search.

2. Recommended Workflow (JSON + EMB files)

Run create-JSON-EMB.py to create JSON files from PDFs. This is usually a long process but it needs to be done only once. Use the "Add Folders" button to add folders with PDF files to the queue. Make sure that the checkboxes from the "Process JSON" and "Process EMB" columns are checked. Press the "Start Processing" button to start the JSON and EMB creation.
Use BM25-String-Embed-Rerank-PDF-Search.py for full search functionality, including embedding-based search and reranking.

Scripts Overview

`create-JSON-EMB.py`

Extracts text from PDF files into JSON format, chunked by pages.
Outputs one JSON file per PDF, named after the original PDF.
Generates embeddings for the text in JSON files using fastembed.
Produces .emb files corresponding to the JSON files. These contain embeddings but exclude text for reduced size.
The script checks for already processed PDFs, so if the user adds more PDFs to a folder, running the script again will only add the missing JSON files.
The script checks for already processed JSON files, so if the user adds more PDFs and processes them into JSON files, running the script again will only add the missing EMB files.

`count tokens-words.py` (not required for the workflow)

Analyzes JSON files to calculate statistics like word count, token count, and token distribution.
Outputs summaries and ASCII histograms for insights into document structure.

`BM25-String-Embed-Rerank-PDF-Search.py`

Main GUI application for searching the knowledge base.
Supports BM25, simple text search, and embedding-based search (if EMB files are available and fastembed is installed).
Allows opening search result pages in the external PDF viewer for easier browsing.

GUI Usage

Components

Search Input Box: Type your search keywords here. Press Enter to execute the search.
Text Results Box: Displays the most relevant text, including:
- Filename
- Page number
- Relevance score
PDF Viewer: Displays the corresponding PDF page with highlighted keywords.
Dropdown Boxes:
- Search Method: Select one of the following methods:
  - BM25: Default search using BM25.
  - Simple Text Search: Searches for exact keywords and phrases.
  - Embeddings Search: Uses semantic search based on embeddings (requires fastembed and EMB files).
- Reranking Method: Options include:
  - No Reranking
  - Minimal Span-Based Scoring
  - Exact Text Search
  - Embeddings Rerank (requires fastembed)
- Behavior:
  - When Simple Text Search or Embeddings Search is selected, the reranking options are disabled.
  - When BM25 is selected, reranking options are enabled.
Navigation Buttons:
- <-- and -->: Navigate between search results.
- + and -: Adjust text font size.
- Ctrl++, Ctrl+-, Ctrl+0: Zoom in, zoom out, or reset zoom for the PDF viewer.

PDF Viewer Integration

Click Behavior: Clicking on the PDF view within the GUI will open the currently displayed page in the external PDF viewer (Okular by default).
Configuring the PDF Viewer:
1. Open BM25-String-Embed-Rerank-PDF-Search.py in a text editor.
2. Locate the following section:
```
subprocess.run(["okular", self.current_pdf_path, "-p", str(self.current_page)])
```
3. Replace "okular" with the command for your preferred PDF viewer. For example:
  - Evince: ["evince", "--page-label=" + str(self.current_page), self.current_pdf_path]
  - Adobe Reader: ["acrord32", "/A", "page=" + str(self.current_page), self.current_pdf_path]
  - Foxit Reader: ["foxitreader", "-page", str(self.current_page), self.current_pdf_path]
4. Save the file.

Note: Different PDF viewers may have varying command-line options for specifying the page to open. Consult the documentation for your preferred viewer for the correct syntax.

Keyboard Shortcuts

Enter: Perform a search.
Alt+Left / Alt+Right: Navigate results.
PageUp / PageDown: Navigate through PDF pages in the viewer.
Ctrl+Left, Ctrl+Right, Ctrl+Up, Ctrl+Down: Scroll the PDF view.
Ctrl++, Ctrl+-, Ctrl+0: Adjust PDF zoom.

Tips

Organize your PDFs in folders and keep separate JSON and EMB files for each folder.
For best performance, use fastembed to enable embedding-based searches and reranking.
If the create-JSON-EMB.py script freezes while processing PDFs, optimize or recreate the problematic PDFs.

Notes

There is a delay when the application starts as it downloads the embedding model. This happens only once. Subsequent runs will not have this delay.
This delay also happens when using the BM25 search method with the Embedding type reranking. This happens only once. Subsequent runs will not have this delay.
There is also a delay when loading the PDF library at the application start. This depends on the size of the PDF library.
Searching with BM25 search with Embedding type reranking has a delay that depends on the number of search hits, as the application has to create embeddings for all the search hits after pressing Enter to submit the query.

Example

Place PDFs in a folder.

Run:

python create-JSON-EMB.py
python BM25-String-Embed-Rerank-PDF-Search.py

Use the GUI to search your knowledge base.

Name	Name	Last commit message	Last commit date
Latest commit Topping1 Update README.md Mar 10, 2025 40fa0ae · Mar 10, 2025 History 46 Commits
old	old	Update readme.md	Feb 3, 2025
BM25-String-Embed-Rerank-PDF-Search.py	BM25-String-Embed-Rerank-PDF-Search.py	Automatic detection of platform (linux or windows)	Mar 10, 2025
LICENSE	LICENSE	Initial commit	Dec 27, 2024
README.md	README.md	Update README.md	Mar 10, 2025
count tokens-words.py	count tokens-words.py	Add files via upload	Jan 6, 2025
create-JSON-EMB.py	create-JSON-EMB.py	added batch processing and optional fastembed	Mar 5, 2025
screenshot-BM25-PDF-Search.jpg	screenshot-BM25-PDF-Search.jpg	Add files via upload	Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Knowledge Base Builder and Search Interface

News

Features

Installation

Requirements

Optional Libraries

External Dependency

Workflow

1. Minimum Workflow (JSON files only)

2. Recommended Workflow (JSON + EMB files)

Scripts Overview

`create-JSON-EMB.py`

`count tokens-words.py` (not required for the workflow)

`BM25-String-Embed-Rerank-PDF-Search.py`

GUI Usage

Components

PDF Viewer Integration

Keyboard Shortcuts

Tips

Notes

Example

About

Releases 7

Packages

Languages

License

Topping1/BM25-PDF-Search

Folders and files

Latest commit

History

Repository files navigation

PDF Knowledge Base Builder and Search Interface

News

Features

Installation

Requirements

Optional Libraries

External Dependency

Workflow

1. Minimum Workflow (JSON files only)

2. Recommended Workflow (JSON + EMB files)

Scripts Overview

create-JSON-EMB.py

count tokens-words.py (not required for the workflow)

BM25-String-Embed-Rerank-PDF-Search.py

GUI Usage

Components

PDF Viewer Integration

Keyboard Shortcuts

Tips

Notes

Example

About

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

`create-JSON-EMB.py`

`count tokens-words.py` (not required for the workflow)

`BM25-String-Embed-Rerank-PDF-Search.py`

Packages