Eedi - Mining Misconceptions in Mathematics. See my kaggle solution.
Make virtual env and install deps.
pip install -e .
Copy .env.example
to .env
and add openai key (only for paraphrase).
Download dataset.
./scripts/download_data.sh
Use openai gpt-4o mini to paraphrase the questions and the miconceptions to increase dataset size. For each question and misconception, create 4 more paraphrase. Costs about $0.36
python eedi/paraphrase.py --dataset-dir=data
Note: this is only done once, you can download paraphrased data here.
Use openai gpt-4o to generate synthetic data increase dataset size. Some details:
- For misconceptions present in train, use 1-shot from the actual row in train set, then let the model generate 3 things: question, correct answer, and wrong answer.
- For misconceptions not present in train, use 2-shot hardcoded in the prompt, then let the model generate 5 things: subject, construct, question, correct answer, and wrong answer.
- Misconceptions are not changed at all, i.e. misconceptions were not generated.
- I did some light skimming and the there are quite many incorrect result. This might be because I did not use reasoning during text generation (expensive and slow).
- Synthetic generation costs about $30
- There are around 31500 synthetic rows and 4300 original (non synthetic) rows.
python eedi/generate_synthetic.py --dataset-dir=data
Note: this is only done once, you can download paraphrased data here.
Finetune embedding model with hard negative mining. First, download paraphrased dataset.
./scripts/download_paraphrased_data.sh
./scripts/download_synthetic_data.sh
Edit training script and run it.
./scripts/train.sh