kaggle-eedi

Eedi - Mining Misconceptions in Mathematics. See my kaggle solution.

preparation

Make virtual env and install deps.

pip install -e .

Copy .env.example to .env and add openai key (only for paraphrase).

Download dataset.

./scripts/download_data.sh

usage

paraphrase

Use openai gpt-4o mini to paraphrase the questions and the miconceptions to increase dataset size. For each question and misconception, create 4 more paraphrase. Costs about $0.36

python eedi/paraphrase.py --dataset-dir=data

Note: this is only done once, you can download paraphrased data here.

synthetic data generation

Use openai gpt-4o to generate synthetic data increase dataset size. Some details:

For misconceptions present in train, use 1-shot from the actual row in train set, then let the model generate 3 things: question, correct answer, and wrong answer.
For misconceptions not present in train, use 2-shot hardcoded in the prompt, then let the model generate 5 things: subject, construct, question, correct answer, and wrong answer.
Misconceptions are not changed at all, i.e. misconceptions were not generated.
I did some light skimming and the there are quite many incorrect result. This might be because I did not use reasoning during text generation (expensive and slow).
Synthetic generation costs about $30
There are around 31500 synthetic rows and 4300 original (non synthetic) rows.

python eedi/generate_synthetic.py --dataset-dir=data

Note: this is only done once, you can download paraphrased data here.

finetune embedding model

Finetune embedding model with hard negative mining. First, download paraphrased dataset.

./scripts/download_paraphrased_data.sh
./scripts/download_synthetic_data.sh

Edit training script and run it.

./scripts/train.sh

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
eedi		eedi
notebooks		notebooks
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lambdalabs.md		lambdalabs.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kaggle-eedi

preparation

usage

paraphrase

synthetic data generation

finetune embedding model

About

Releases

Packages

Languages

License

evanarlian/kaggle-eedi

Folders and files

Latest commit

History

Repository files navigation

kaggle-eedi

preparation

usage

paraphrase

synthetic data generation

finetune embedding model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages