This repository contains various scripts and code pieces used to analyze and predict grades based on ChatGPT interactions. The primary dataset includes HTML files of ChatGPT prompts/answers and a Jupyter notebook (assignment.ipynb) containing assignment questions. The project's main Jupyter notebook is here and
- Data Extraction and Imputation
: Extracted text from HTML files of ChatGPT prompts to a JSON file. For IDs with missing text, imputed text from files with mode size.
- Question-Answer Pair Extraction: Analyzed chat texts to extract question-answer pairs from them. Questions start with "Anonymous" and responses with "ChatGPTChatGPT".
- Reading Assignment Questions: Extracted questions from the assignment.ipynb, specifically from markdown cells in the "source" part.
- Data Visualization: Visualized score data to understand the distribution and identify null data.
- Similarity Calculation: Computed similarities between assignment questions and user prompts, adding this data to the JSON file for each ID and each question.
- Histograms of Similarities: Plotted histograms of similarities for each question and each ID, calculating average similarity as a predictive feature.
- Linear Regression Models: Multiple linear regression models were trained to predict grades based on various features like average similarities, prompt length, number of prompts, average sentiment, response length, frequency of word "error" in prompts and frequency of using the word "error" in back to back prompts.
This project adopts a comprehensive and multifaceted approach to predict the scores based on ChatGPT interactions. The methodology employed in this project encompasses various stages of data processing, feature extraction, and predictive modeling, each contributing to the overarching goal of understanding and predicting user scores.
- Data Extraction: The project begins with the extraction of ChatGPT interactions from HTML files. Special attention is given to handle malformed files, where missing data is imputed using the text from files with the mode file size, ensuring a complete dataset for analysis.
- Data Visualization: Initial exploration of the
scores.csv
data involves visualization to assess the distribution of scores and identify any missing data, which is subsequently imputed using the mean of the column.
- Extraction of Question-Answer Pairs: The project focuses on extracting question and answer pairs from ChatGPT interactions, identifying questions with "Anonymous" and responses with "ChatGPTChatGPT".
- Assignment Question Analysis: Questions from the
assignment.ipynb
are extracted and analyzed, particularly from markdown cells in the "source" part. - Similarity Calculation: A key aspect of the methodology involves calculating the similarities between the text of assignment questions and user prompts, integrating this information into the dataset for each user ID.
- Development of Linear Regression Models: Various features such as average similarities, prompt length, number of prompts, average sentiment, the length of GPT responses, frequency of word "error" in prompts, and frequency of using the word "error" in back to back prompts are utilized to train linear regression models.
- Performance Evaluation: Each model's effectiveness is assessed using Mean Squared Error (MSE) and R-squared values, allowing for a comparative analysis of different predictive features.
- Comparative Analysis of Features: The project identifies which features have the most significant impact on predicting scores. It was observed that the number of prompts and the total number of words in prompts show relatively better predictive performance.
- Iterative Approach: The project's methodology is iterative, constantly refining the features and models based on the insights gained from data analysis and model evaluations.
Overall, the project methodology is characterized by its data-driven approach, leveraging natural language processing techniques and statistical modeling to derive meaningful insights and predictions from ChatGPT interactions.
The experimental findings are supported by various figures and the following table summarizes the model performances.
Feature | Mean Squared Error | R-squared Score |
---|---|---|
Average Similarities | 41.99 | -0.40 |
Total Number of words in Prompts | 56.67 | -0.89 |
Number of Prompts | 55.60 | -0.85 |
Average Sentiment | 40.74 | -0.36 |
Average Prompt Length | 41.57 | -0.39 |
Average Response Length | 38.43 | -0.28 |
Frequency of "error" in Prompts | 43.16 | -0.44 |
Back-to-Back "error" Counts in Prompts | 45.08 | -0.50 |
The observed results demonstrate a varied level of accuracy and effectiveness across different features in the prediction of scores. Despite the careful approach taken in the project, the limited size of the dataset and the narrow standard deviation, where 50% of the scores exceed 72, result in the models exhibiting relatively large errors in their predictions.
Milad Bafarassat: As the sole contributor to this project, I was responsible for all aspects, including data extraction and preprocessing, feature engineering, model development, analysis, and documentation. My role encompassed the entire pipeline from initial data handling to final model evaluation and reporting of findings.