Skip to content
/ CURE Public

For our ICSE21 paper "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair" by Nan Jiang, Thibaud Lutellier, and Lin Tan

License

Notifications You must be signed in to change notification settings

lin-tan/CURE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

A pytorch implementation of paper CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

  • results: This folder contains all the bugs in both Defects4J and QuixBugs benchmarks that CURE fixed. Each file contains the buggy line, CURE's patch and the developer's patch
  • candidate_patches: This folder contains all the candidate patches CURE generated for bugs in each benchmark
  • data: This folder contains the vocabulary file, subword tokenizer, some training data examples, and the GPT PL model pre-trained on code.
    • vocabulary
      • subword.txt: the subword tokenizer model needed by subword-nmt
      • vocabulary.txt: the vocabulary file used in CURE's paper
    • models: This folder is used to save the models
      • code_gpt.pt: the save GPT PL model trained on code
    • patches: This folder is used to save the generated patches
      • gpt_conut_1.txt: an example file that contains the candidate patches generated by a GPT-CoNuT model, including 100 patches for each QuixBugs bug.
      • gpt_fconv_1.txt: an example file that contains the candidate patches generated by a GPT-FConv model, including 100 patches for each QuixBugs bug.
    • data: This folder is used to save the training data and validation data
      • CURE uses the source code training data shared by previous work CoCoNuT
  • src: This folder includes the source code for CURE's APR model

Dependency

  • Python 3.8
  • PyTorch 1.4.0
  • NumPy 1.18.1
  • Huggingface transformers 2.10.0
  • subword-nmt

Usage

To train a GPT-CoNuT model, run src/trainer/gpt_conut_trainer.py Some settings you may need to change:

  • vocab_file: the path to the vocabulary file used by the model
  • train_file: the path to the training data
  • valid_file: the path to the validation data
  • gpt_file: the path to the saved GPT PL model
  • hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
  • save_dir: the directory to save the model, default: data/models/

To train a GPT-FConv model, run src/trainer/gpt_fconv_trainer.py Some settings you may need to change:

  • vocab_file: the path to the vocabulary file used by the model
  • train_file: the path to the training data
  • valid_file: the path to the validation data
  • gpt_file: the path to the saved GPT PL model
  • hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
  • save_dir: the directory to save the model, default: data/models/

To prepare input for new test data, check data/data/prepare_testing_data.py, make sure you check the readme file and follow the three steps to prepare the test input.

To generate patches, run src/tester/generator.py Some settings you may need to change:

  • vocab_file: the path to the vocabulary file used by the model
  • input_file: the input data to the model for generating patches, with each line referring to a bug in the following format: buggy line <CTX> surrounding function. see candidate_patches/QuixBugs/quixbugs_bpe.txt for reference.
  • identifier_txt_file: the valid identifiers for each bug, with each line being a list of valid identifiers, identifiers are split by space. see candidate_patches/QuixBugs/identifier.txt for reference
  • identifier_token_file: the tokenized identifiers for each bug, with each line being a list of valid identifiers tokenized by camel letter, underscore, and subword. identifiers are split by \t. see candidate_patches/QuixBugs/identifier.tokens for reference
  • output_file: the path to the output result
  • beam_size: the number of candidate patches generated by each model
  • model_file: the path to the saved APR model
  • CURE's trained models: https://zenodo.org/record/7030145#.YwvXfFvMI5l

data/patches/gpt_conut_1.txt and data/patches/gpt_fconv_1.txt are example candidate patches generated by GPT-CoNuT and GPT-FConv models for QuixBugs benchmark.

To validate the candidate patches generated by models, run src/validation/rerank.py, which will rerank the patches generated by all the models and the result will be dumped into data/patches/reranked_patches.json, then run src/validation/validate_quixbugs.py or src/validation/validate_defects4j.py, which will run unit test cases (offered by Defects4J or QuixBugs) to validate the candidate patches. The final result will be dumped into data/patches/validated_patches.json

If you use CURE for academic purpose, please cite the following citation:

@inproceedings{jiang2021cure,
  author={Jiang, Nan and Lutellier, Thibaud and Tan, Lin},
  booktitle={2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)}, 
  title={CURE: Code-Aware Neural Machine Translation for Automatic Program Repair}, 
  year={2021},
  pages={1161-1173},
  doi={10.1109/ICSE43902.2021.00107}
}

About

For our ICSE21 paper "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair" by Nan Jiang, Thibaud Lutellier, and Lin Tan

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •