This repository contains a Python script designed to conduct research on the comparative effectiveness of Large Language Models (LLMs) in evaluating manual test cases. The research aims to explore whether AI models can match or exceed human engineers in assessing the quality of test cases, and to investigate the impact of LLM reasoning capabilities and prompt engineering on evaluation quality.
This project uses Langchain and Ollama to evaluate test cases against predefined quality criteria, providing a structured, data-driven approach to test case assessment.
This research endeavors to answer the following key questions:
- Quality Comparison: Are AI models capable of writing higher quality test cases compared to human software engineers?
- Reasoning Impact: Does the inherent reasoning capability of different LLMs significantly influence the quality of test case evaluations they produce?
- Prompt Engineering vs. Model Reasoning: Can prompt engineering applied to a standard LLM achieve comparable reasoning quality to a more advanced LLM, specifically in the context of test case evaluation?
The codebase is structured as follows:
-
data/
: This directory contains data files:-
raw_data.json
: Raw data of test cases imported from google sheets. -
cleaned_data.json
: Cleaned and structured copy of raw data containing the manual test cases to be evaluated. -
data/evaluations/[model_name]
: Directory to store evaluation results
NOTE: Folder and file structure will be same, just group into the processing model's name folder for better arrangements
:-
archive/processed_results.json
: Archive folder contain processed test cases evaluation to save them from separately from other files, basically its a copy ofsuccess.json
. -
remaining.json
: JSON file storing remaining test case that are left to be processed. -
success.json
: JSON file storing successful test case evaluations. -
failed.json
: JSON file storing details of failed test case evaluations.
-
-
data/results/[model_name]
: Directory to store evaluation results
NOTE: Folder and file structure will be same, just group into the processing model's name folder for better arrangements
:-
stats.json
: JSON file storinggroup-wise
calculated stats over evaluation score of each test case. -
adv_stats.json
: JSON file storing group-wise relation between the scoring performingANOVA
andT-Test
.
-
-
-
modules/
: This directory houses Python modules:-
helper.py
: Contains helper functions for data loading, chunking, and saving. -
stats_helper.py
: This fil contains some helper functions related to calculate stats of the evaluated results:-
It has function that calculates descriptive statistics for each group in the dataset.
-
A function that performs statistical tests (ANOVA and pairwise t-tests) on the dataset.
-
And a function that calculate comprehensive statistics and comparative metrics for test case groups.
-
-
langchain_helper.py
: Sets up the Langchain components, including:-
Pydantic model definitions for structured output parsing.
-
Prompt template for guiding LLM evaluation.
-
Chain creation using Langchain and Ollama.
-
-
-
main.py
: The main script to run the test case evaluation process. -
groc_main.py
: Same script asmain.py
. It just uses Groq API for test case evaluation process.
NOTE: Kindly ensure to import your API Keys when using Groq.
# Step 1 - Rename the `example_api_key.pys` file to `api_keys.py` # Step 2 - Paste all your API Keys into it
-
calc_stats.py
: Script to calculate stats of each group based on the evaluation scores provided by any of the above script i.e.main.py
orgroc_main.py
.
NOTE: Kindly ensure to evaluate all test cases before running this stats script.
To run this script, you'll need to set up your environment with the necessary dependencies and models.
Following are the libraries that this Python project require to evaluate the quality of test cases:
-
Python 3.x: Download Python
-
Git: Download Git
-
VS Code (Optional but recommended): Download VS Code
-
Ollama: Ensure you have Ollama installed and running. Ollama is used to host and serve the LLMs locally. You can download it from https://ollama.com/.
-
LLMs: Pull the required LLMs using Ollama. The script is configured to use
llama3.2:3b
, ormistral:7b
. You can pull these models using Ollama CLI:ollama pull llama3.2:3b
ollama pull mistral:7b
-
langchain-groq: In case you wanted to use
Groq API
, refer this official documentation from groqCloud.
Follow these steps to clone the repository, set up your virtual environment, install dependencies, and run the project locally.
Open your terminal and run:
git clone https://github.com/mdazlaanzubair/deep-seeking-test-cases.git
cd deep-seeking-test-cases
It is recommended to use a virtual environment to manage dependencies.
- On Windows:
python -m venv env
- Activate the virtual environment:
.\env\Scripts\activate
- On macOS/Linux:
python3 -m venv env
- Activate the virtual environment:
source env/bin/activate
If a requirements.txt file is available, install the dependencies by running:
- On Windows:
pip install -r requirements.txt
- On macOS/Linux:
pip3 install -r requirements.txt
Alternatively, you can install the necessary packages manually:
- On Windows:
pip install langchain langchain-ollama langchain-groq tqdm pandas numpy matplotlib seaborn scipy
- On macOS/Linux:
pip3 install langchain langchain-ollama langchain-groq tqdm pandas numpy matplotlib seaborn scipy
-
Prepare Test Case Data: Ensure your test case data is in the
data/cleaned_data.json
file and conforms to the expected structure (as implied by the script's input variables). -
Run the main script with:
- On Windows:
python main.py
- On macOS/Linux:
python3 main.py
-
Monitor Progress: The script uses
tqdm
to display progress bars for chunk and test case processing in the console. -
View Results: After execution, the evaluation results will be saved in the
data/evaluations/
directory:-
success.json
: Contains detailed JSON outputs for each successfully evaluated test case, including scores for coverage, clarity, edge cases, non-functional coverage, and justifications. -
failed.json
: Contains details of any test cases that failed during evaluation, including error messages and raw LLM output (if available).
-
-
Model Selection: The
main.py
script usesmistral:7b
as the active model by default (active_model = models_list[2]
). To change the model, modify themodels_list
and the index foractive_model
in themain.py
script. Ensure the model you select is pulled via Ollama. -
Chunk Size: The
chunk_data
function inmodules/helper.py
is set to chunk test cases into groups of 3 (chunk_data(test_cases, 3)
). Adjust the chunk size as needed for performance or batch processing preferences. -
Prompt Template: The prompt template used for evaluation is defined in
modules/langchain.py
within theget_prompt()
function. You can customize this prompt to adjust the evaluation criteria, instructions, or scoring scale.
The script expects the input data in data/cleaned_data.json
to be a list of dictionaries, where each dictionary represents a test case and includes the following keys (input variables for the prompt):
software_name
software_desc
test_case_id
test_module
test_feature
test_case_title
test_case_description
pre_conditions
test_steps
test_data
expected_outcome
severity_status
group
(This field is used for grouping and is removed before passing to the LLM in order to remove biasness)
The output is structured in JSON format as defined by the Pydantic models in modules/langchain.py
, providing scores and reasons for each evaluation criterion.
-
Virtual Environment: Ensure that your virtual environment is activated. You should see (venv) in your terminal prompt.
-
Dependency Installation: Verify that you have a stable internet connection and the necessary permissions to install packages.
-
LangChain / Ollama Setup: Make sure that your Ollama service or local model is correctly configured and running. Refer to the langchain-ollama documentation for additional guidance.
This project is licensed under the MIT License - see the LICENSE file for details.