Skip to content

This repository contains a collection of codes and scripts developed as the project for Machine Learning course during the Fall 2023 semester at Sabanci University.

License

Notifications You must be signed in to change notification settings

Miladbaf/CS412_Term_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS412 (Machine Learning) Term Project

Student working on ML Homework

Student working on ML Homework with some AI help. Credit: DALL-E

Overview of the Repository

This repository contains various scripts and code pieces used to analyze and predict grades based on ChatGPT interactions. The primary dataset includes HTML files of ChatGPT prompts/answers and a Jupyter notebook (assignment.ipynb) containing assignment questions. The project's main Jupyter notebook is here and Google Colab

Key Components

  1. Data Extraction and Imputation : Extracted text from HTML files of ChatGPT prompts to a JSON file. For IDs with missing text, imputed text from files with mode size.
    • Question-Answer Pair Extraction: Analyzed chat texts to extract question-answer pairs from them. Questions start with "Anonymous" and responses with "ChatGPTChatGPT".
    • Reading Assignment Questions: Extracted questions from the assignment.ipynb, specifically from markdown cells in the "source" part.
  2. Data Visualization: Visualized score data to understand the distribution and identify null data.
  3. Similarity Calculation: Computed similarities between assignment questions and user prompts, adding this data to the JSON file for each ID and each question.
  4. Histograms of Similarities: Plotted histograms of similarities for each question and each ID, calculating average similarity as a predictive feature.
  5. Linear Regression Models: Multiple linear regression models were trained to predict grades based on various features like average similarities, prompt length, number of prompts, average sentiment, response length, frequency of word "error" in prompts and frequency of using the word "error" in back to back prompts.

Methodology

This project adopts a comprehensive and multifaceted approach to predict the scores based on ChatGPT interactions. The methodology employed in this project encompasses various stages of data processing, feature extraction, and predictive modeling, each contributing to the overarching goal of understanding and predicting user scores.

Data Processing and Preparation

  • Data Extraction: The project begins with the extraction of ChatGPT interactions from HTML files. Special attention is given to handle malformed files, where missing data is imputed using the text from files with the mode file size, ensuring a complete dataset for analysis.
  • Data Visualization: Initial exploration of the scores.csv data involves visualization to assess the distribution of scores and identify any missing data, which is subsequently imputed using the mean of the column.

Feature Engineering

  • Extraction of Question-Answer Pairs: The project focuses on extracting question and answer pairs from ChatGPT interactions, identifying questions with "Anonymous" and responses with "ChatGPTChatGPT".
  • Assignment Question Analysis: Questions from the assignment.ipynb are extracted and analyzed, particularly from markdown cells in the "source" part.
  • Similarity Calculation: A key aspect of the methodology involves calculating the similarities between the text of assignment questions and user prompts, integrating this information into the dataset for each user ID.

Predictive Modeling

  • Development of Linear Regression Models: Various features such as average similarities, prompt length, number of prompts, average sentiment, the length of GPT responses, frequency of word "error" in prompts, and frequency of using the word "error" in back to back prompts are utilized to train linear regression models.
  • Performance Evaluation: Each model's effectiveness is assessed using Mean Squared Error (MSE) and R-squared values, allowing for a comparative analysis of different predictive features.

Insights and Conclusions

  • Comparative Analysis of Features: The project identifies which features have the most significant impact on predicting scores. It was observed that the number of prompts and the total number of words in prompts show relatively better predictive performance.
  • Iterative Approach: The project's methodology is iterative, constantly refining the features and models based on the insights gained from data analysis and model evaluations.

Overall, the project methodology is characterized by its data-driven approach, leveraging natural language processing techniques and statistical modeling to derive meaningful insights and predictions from ChatGPT interactions.

Results

The experimental findings are supported by various figures and the following table summarizes the model performances.

Performance of the Models:

Feature Mean Squared Error R-squared Score
Average Similarities 41.99 -0.40
Total Number of words in Prompts 56.67 -0.89
Number of Prompts 55.60 -0.85
Average Sentiment 40.74 -0.36
Average Prompt Length 41.57 -0.39
Average Response Length 38.43 -0.28
Frequency of "error" in Prompts 43.16 -0.44
Back-to-Back "error" Counts in Prompts 45.08 -0.50

The observed results demonstrate a varied level of accuracy and effectiveness across different features in the prediction of scores. Despite the careful approach taken in the project, the limited size of the dataset and the narrow standard deviation, where 50% of the scores exceed 72, result in the models exhibiting relatively large errors in their predictions.

Figures (Features vs Grades):

avgsim vs grades

Fig 1. Average Similarities of Prompts/Assignment-Questions vs Grades


avgsim vs grades

Fig 2. Number of words in Prompts vs Grades


avgsim vs grades

Fig 3. Number of prompts vs Grades


avgsim vs grades

Fig 4. Average Sentiment of Prompts vs Grades


avgsim vs grades

Fig 5. Average Prompt Length vs Grades


avgsim vs grades

Fig 6. Average GPT Response Length vs Grades


avgsim vs grades

Fig 7. Frequency of "error" in Prompts vs Grades


avgsim vs grades

Fig 8. Back-to-Back "Error" Counts in Prompts vs Grades


Team Contributions

Milad Bafarassat: As the sole contributor to this project, I was responsible for all aspects, including data extraction and preprocessing, feature engineering, model development, analysis, and documentation. My role encompassed the entire pipeline from initial data handling to final model evaluation and reporting of findings.

About

This repository contains a collection of codes and scripts developed as the project for Machine Learning course during the Fall 2023 semester at Sabanci University.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published