Repository for Learning Performance-Improving Code Edits (paper, website).
🚨 Benchmarking programs is easy; but benchmarking programs in a reproducible and deterministic manner is very hard.
🚨 LLMs are not great at program optimization out-of-the-box (at least for competitive programming problems)
We perform extensive experiments to evaluate and improve Large Language Models (LLMs) for program optimization. We built a custom evaluation framework that benchmarks program execution time in a highly-reliable manner and we provide a dataset annotated with execution time information from our environment.
When measuring average program speedup, we obtained a fine-tuned version of CodeLlama13B that outperforms GPT4 and the best human programmer. Using self-play for program optimization, we also obtain a fine-tuned version of GPT3.5 that is even stronger.
- PIE is based on IBM CodeNet. Huge thanks to the authors of CodeNet for making their curated dataset available!
Our Train/Val/Test splits are located here. There is also a train_with_synthetic.jsonl
file which contains and additional ~1.4K pairs generated via self-play. We also have subsets train_hq_only.jsonl
and train_hq_and_synthetic.jsonl
which contain only high-quality pairs and high-quality pairs + synthetic pairs respectively.
Testcases:
- Merged test cases containing both public and generated test cases: these test cases were the ones used for experiments in the paper.
- Public test cases. These test cases are sourced from IBM CodeNet.
- Generated test cases. These test cases are sourced from alphacode.
The column tests
in the jsonl files will contain the indices which should be used for benchmarking models.
Benchmarking programs is easy; but benchmarking programs in a reproducible and deterministic manner is very hard. It is important, because we want to compare the performance of different models on the same set of programs irrespective of a reserarcher's server configuration. Moreover, you can even wind up in scenarios where you can benchmark the same exact program and accidentally believe one is much faster than the other.
We built a custom evaluation framework that benchmarks program execution time in a highly-reliable manner. We built an execution sandbox based on the gem5 simulator. Given program termination/a program not timing out, benchmarking results are deterministic. For our experiments, we use a simulated CPU of the Intel Skylake CPU. We provide an easy-to-use docker image and API that can be used to reproduce our results and for other researchers to continue to use for program optimization research.
Building the environment is similar to the gym API for reinforcement learning. After importing the module and running make, the docker image should automatically be pulled on the first iteration and a container created. The environment then provides a convenient abstraction for interacting with the environment. More information is located at gem5.
It is possible that on a separate architecture, the gem5 simulator runs slower or faster then when we ran it, so results could be influenced by more-frequent and less-frequent timeouts. Generally this should affect programs on the threshold of timing out, and it should affect more-aggressive optimizations (often "better" models) less than less-aggressive optimizations.
import simulator
# pulls the image from docker hub if it doesn't exist and sets up a connection with a running container
env = simulator.make(arch='X86-skylake', optimization_flag='-O3')
# example sending a program to benchmark within the environment
gem5_benchmarking_results = env.submit_single_submission(...)
Programs can typically be written in many ways with different performance profiles. When training a model to predict performance-improving edits with a large dataset, it may be trained on a mix of large and small improvements, without any information on which improvements are more desirable than others. We introduce performance tags during training by associating each “fast” program with a tag indicating the optimal achievable performance across all solutions in the dataset.
Specifically, the tag indicates how close that program is to peak performance on a binned-scale {1, 2, . . . , 10}. Then at test time, we prompt the model with a test input and a maximal score tag “10/10”, directing it to generate a highly-optimized solution.
The performance tags are available for the training dataset and can be used to train models with performance-conditioning. We also provide our fine-tuning code which adds the prompts during training and inference.
In an attempt to boost the performance of our models, we also investigate the use of self-play for program optimization as a data augmentation technique. Because there is a limited set of programs in our dataset, we use an off-the-shelf language model to generate new programs and a high-quality fine-tuned model to generate new optimizations. After taking some rigorous steps to ensure the generated programs are semantically novel and the optimizaitons are non-trivial, we use the generated programs and optimizations to create new program optimization pairs.
The self-play notion comes from the fact that one model is used to generate the programs to solve and the other model is used to generate solve/propose the optimizations.
Our best model without self-play was with GPT3.5 turbo, our best fine-tuned model was trained with 4,085 high quality pairs. We were able to sample 3,314 novel programs and obtain 1,485 high-quality optimizations.
Using these additional 1,485 optimizations helped improve the performance of our fine-tuned model. We also performed an ablation by adding 1,485 next-best programs from the PIE dataset for fine-tuning GPT3.5 turbo, but these pairs led to performance degradation.
We provide our scripts for sampling programs and detecting semantic duplicates and the self-play data itself.
We provide a docker image at yimengzeng/pie:torch201
which contains all of the dependencies for finetuning the model, you can also refer to docker/Dockerfile
for the specific packages required to replicate the environment.
To finetune codellama with the entire PIE dataset and the non-performance-conditioned prompt, run
bash finetuning/train.sh
To finetune codellama with the performance-conditioned prompt, change the --prompt_template_name
flag to "code_opt_w_speedup_pctile"
More details are located in the finetuning
directory.
The script openai_finetuning/finetune_openai.py
was used to finetune GPT3.5 Turbo. Its usage is as follows:
python finetune_openai.py PATH_TO_CONFIG.yaml
More details and an example config file are located in the openai_finetuning
directory.
A notebook that can be used to prepare the retrieval dataset is retrieval/retrieval.ipynb
. Given a training dataset and the test set examples to optimize, it will retrieve the K most similar training examples pairs for the given test set examples. The retrieved pairs are then used to prompt the model for optimized outputs.
To generate prompts for the models, please follow details in the paper. Additional utilities for constructing prompts are located in finetinung/templates
and the funetuning/utils/prompter.py
module which constructs prompts.
Samples from our fine-tuned models are located here.
To sample optimized programs using the finetuned model with the text-generation-inference
tool, first replace the PATH_TO_MODEL
field to the acutal path of the finetuned model in server.sh
, and then to serve the model, run
bash finetuning/server.sh
To sample from the model just served with default parameters as in the paper, run
bash finetuning/sample.sh
More details are located in the finetuning
directory.
We used prompt-lib to sample from OpenAI's endpoints.
The directory data_augmentation
contains the scripts used to sample and filter out novel competitive programming problems for PIE.
Running data_augmentation/data_augmentation_driver_final.sh
contains the final parameters we used to sample the problems. More details are located in the data_augmentation
directory.
The evaluation driver is located in gem5/gem5_eval.py
. This script requires a yaml configuration file to be passed in as an argument to --config_path
. Example usage from the project directory would be:
export PYTHONPATH=$PYTHONPATH:$(pwd)
python gem5/gem5_eval.py --config_path PATH_TO_EXPERIMENT_CONFIG.yaml
Results from our experiments can be located in this google drive folder.
More details are located in the gem5
directory.
@inproceedings{pie_iclr_2024_spotlight,
title={\href{https://openreview.net/pdf?id=ix7rLVHXyY}{Learning Performance-Improving Code Edits}},
author={Shypula, Alexander and Madaan, Aman and Zeng, Yimeng and Alon, Uri and Gardner, Jacob and Hashemi, Milad and Neubig, Graham and Ranganathan, Parthasarathy and Bastani, Osbert and Yazdanbakhsh, Amir},
booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
year={2024}
}