Bachelor Thesis Proof of Concept Code

This repository contains the proof of concept code for my bachelor thesis on the topic "Rust for contextualizing industrial semantic data with language models", submitted on January 13, 2025 at Stuttgart University of Applied Sciences, Germany. The thesis is only available in German.

Overview

The goal of the thesis was to show whether language models are able to contextualize the semantic data from sensors or other IoT devices. None of the models were modified and were used as they were provided on Hugging Face. Therefore, no fine-tuning was involved. The main implementation was done entirely in Rust, while the score evaluation was done in Python using Pandas and Seaborn plots.

The RAG pipeline used is not optimized in any way and is only used to demonstrate it's impact. Especially considering the required external documentation that may be needed to answer specific questions that require more context or information.

Setup

Ensure you have Rust and Cargo installed on your system.
Set up a Qdrant vector database (e.g., using Docker).
Set up a local LLM inference server at port 1337 that support the OpenAI API. Used servers in this project were mistral.rs and llama.cpp. For example with llama.cpp, use the following command to load the given model while loading all layers on the GPU:

.\llama-server.exe --port 1337 --n-gpu-layers 15000 -m "D:\llms\Mistral-Nemo-Instruct-2407-Q6_K.gguf"

Usage

Inserting Data

Use the vector_store binary to insert points from the PDF and JSON files:

cargo run --bin vector_store

Querying

To ask a single question using RAG, use the augment_sense binary:

cargo run --bin augment_sense -- -p "Your question here"

Evaluation

This PoC makes use of a quiz file to process many questions into one response output JSON file. This file can be used to evaluate the responses using the implemented G-Eval (with Coherence, Consistency, Fluency and Relevance files in the /prompts dir), BERTscore and SemScore metrics.

Running answer_questions.ps1 inferences all questions 10 times (or otherwise defined in the script) to get 10 different output JSON files. These can be evaluated step by step with the evaluate_outputs.ps1 that takes all JSON in one directory.

My evaluation in Python can be found in the Jupyter notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
evaluation		evaluation
prompts		prompts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bachelor Thesis Proof of Concept Code

Overview

Setup

Usage

Inserting Data

Querying

Evaluation

About

Releases

Packages

Languages

License

hftroesch/augment_sense

Folders and files

Latest commit

History

Repository files navigation

Bachelor Thesis Proof of Concept Code

Overview

Setup

Usage

Inserting Data

Querying

Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages