diff --git a/_publications/add_from_arxiv.py b/_publications/add_from_arxiv.py index c9cfde73..0d4454a4 100644 --- a/_publications/add_from_arxiv.py +++ b/_publications/add_from_arxiv.py @@ -8,7 +8,7 @@ def _first_non_stopword(title: str) -> str: - for word in re.split("\W", title.lower()): + for word in re.split(r"\W", title.lower()): if word in ("a", "an", "the", "is", "are", "what", "who", "your"): continue return word @@ -30,7 +30,7 @@ def get_info(paper_id: str, out_dir: str) -> None: ) tmpl = textwrap.dedent( - f""" + f"""\ --- layout: publication title: "{paper.title}" @@ -38,7 +38,7 @@ def get_info(paper_id: str, out_dir: str) -> None: conference: year: {paper.published.year} additional_links: - - {{name: "ArXiV", url: "https://arxiv.org/abs/{paper_id}"}} + - {{name: "ArXiV", url: "https://arxiv.org/abs/{paper_id}"}} tags: ["TODO"] --- {summary} diff --git a/_publications/ahmed2024studying.markdown b/_publications/ahmed2024studying.markdown index 1677b84c..2996a1bf 100644 --- a/_publications/ahmed2024studying.markdown +++ b/_publications/ahmed2024studying.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Studying LLM Performance on Closed- and Open-source Data" @@ -6,7 +5,7 @@ authors: Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty conference: year: 2024 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2402.15100"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2402.15100"} tags: ["Transformers"] --- Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning. diff --git a/_publications/chen2023supersonic.markdown b/_publications/chen2023supersonic.markdown index 33e2ff37..053333e2 100644 --- a/_publications/chen2023supersonic.markdown +++ b/_publications/chen2023supersonic.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Supersonic: Learning to Generate Source Code Optimizations in C/C++" @@ -6,7 +5,7 @@ authors: Zimin Chen, Sen Fang, Martin Monperrus conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2309.14846"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2309.14846"} tags: ["optimization"] --- Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic's performance is benchmarked against OpenAI's GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4. diff --git a/_publications/ding2023static.markdown b/_publications/ding2023static.markdown index a4070318..9d0c4fc8 100644 --- a/_publications/ding2023static.markdown +++ b/_publications/ding2023static.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "A Static Evaluation of Code Completion by Large Language Models" @@ -6,7 +5,7 @@ authors: Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, X conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2306.03203"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2306.03203"} tags: ["LLM", "static analysis"] --- Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions. diff --git a/_publications/eniser2023automatically.markdown b/_publications/eniser2023automatically.markdown index 584f40a9..cc664bbb 100644 --- a/_publications/eniser2023automatically.markdown +++ b/_publications/eniser2023automatically.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Automatically Testing Functional Properties of Code Translation Models" @@ -6,7 +5,7 @@ authors: Hasan Ferit Eniser, Valentin Wüstholz, Maria Christakis conference: AAAI year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2309.12813"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2309.12813"} tags: ["translation"] --- Large language models are becoming increasingly practical for translating code across programming languages, a process known as $transpiling$. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations. diff --git a/_publications/li2023hitchhiker.markdown b/_publications/li2023hitchhiker.markdown index 7d1bb9ba..eb046f44 100644 --- a/_publications/li2023hitchhiker.markdown +++ b/_publications/li2023hitchhiker.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models" @@ -6,7 +5,7 @@ authors: Haonan Li, Yu Hao, Yizhuo Zhai, Zhiyun Qian conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2308.00245"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2308.00245"} tags: ["static analysis"] --- Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated agent that interfaces with both a static analysis tool and an LLM. By carefully designing the agent and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates an extremely potent capability, showcasing a high precision (50%) and recall rate (100%). It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in the use of LLMs for bug discovery in extensive, real-world datasets. diff --git a/_publications/li2023starcoder.markdown b/_publications/li2023starcoder.markdown index 90474f19..416b3924 100644 --- a/_publications/li2023starcoder.markdown +++ b/_publications/li2023starcoder.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "StarCoder: may the source be with you!" @@ -6,7 +5,7 @@ authors: Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2305.06161"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2305.06161"} tags: ["Transformer"] --- The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI `code-cushman-001` model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license. diff --git a/_publications/li2023think.markdown b/_publications/li2023think.markdown index 89ab1a41..441e3d49 100644 --- a/_publications/li2023think.markdown +++ b/_publications/li2023think.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation" @@ -6,7 +5,7 @@ authors: Xin-Ye Li, Jiang-Tian Xue, Zheng Xie, Ming Li conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2305.10679"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2305.10679"} tags: ["generation", "Transformer"] --- Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@$k$ metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers. diff --git a/_publications/liu2023code.markdown b/_publications/liu2023code.markdown index 15cf547a..2009fd2d 100644 --- a/_publications/liu2023code.markdown +++ b/_publications/liu2023code.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Code Execution with Pre-trained Language Models" @@ -6,7 +5,7 @@ authors: Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2305.05383"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2305.05383"} tags: ["Transformer", "execution"] --- Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution. diff --git a/_publications/mohajer2023skipanalyzer.markdown b/_publications/mohajer2023skipanalyzer.markdown index 858e960a..cbf424e7 100644 --- a/_publications/mohajer2023skipanalyzer.markdown +++ b/_publications/mohajer2023skipanalyzer.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models" @@ -6,7 +5,7 @@ authors: Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2310.18532"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2310.18532"} tags: ["repair"] --- We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects. We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%. diff --git a/_publications/muennighoff2023octopack.markdown b/_publications/muennighoff2023octopack.markdown index 3e5483d7..718e7c30 100644 --- a/_publications/muennighoff2023octopack.markdown +++ b/_publications/muennighoff2023octopack.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "OctoPack: Instruction Tuning Code Large Language Models" @@ -6,7 +5,7 @@ authors: Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2308.07124"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2308.07124"} tags: ["dataset", "instruction tuning"] --- Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack. diff --git a/_publications/olausson2023demystifying.markdown b/_publications/olausson2023demystifying.markdown index 08466786..8f89853a 100644 --- a/_publications/olausson2023demystifying.markdown +++ b/_publications/olausson2023demystifying.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Demystifying GPT Self-Repair for Code Generation" @@ -6,7 +5,7 @@ authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Ar conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2306.09896"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2306.09896"} tags: ["repair"] --- Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains. diff --git a/_publications/peng2023generative.markdown b/_publications/peng2023generative.markdown index f794b7c1..7238aea7 100644 --- a/_publications/peng2023generative.markdown +++ b/_publications/peng2023generative.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Generative Type Inference for Python" @@ -6,7 +5,7 @@ authors: Yun Peng, Chaozheng Wang, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2307.09163"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2307.09163"} tags: ["types"] --- Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited. This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match. diff --git a/_publications/shrivastava2023repofusion.markdown b/_publications/shrivastava2023repofusion.markdown index 8cea558a..e450ec90 100644 --- a/_publications/shrivastava2023repofusion.markdown +++ b/_publications/shrivastava2023repofusion.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "RepoFusion: Training Code Models to Understand Your Repository" @@ -6,7 +5,7 @@ authors: Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Tor conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2306.10998"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2306.10998"} tags: ["completion"] --- Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}. diff --git a/_publications/silva2023repairllama.markdown b/_publications/silva2023repairllama.markdown index 8969a41f..42df7795 100644 --- a/_publications/silva2023repairllama.markdown +++ b/_publications/silva2023repairllama.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair" @@ -6,7 +5,7 @@ authors: André Silva, Sen Fang, Martin Monperrus conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2312.15698"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2312.15698"} tags: ["repair"] --- Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter' for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines. diff --git a/_publications/wang2023codet5.markdown b/_publications/wang2023codet5.markdown index 1c4abb27..a75b04a2 100644 --- a/_publications/wang2023codet5.markdown +++ b/_publications/wang2023codet5.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "CodeT5+: Open Code Large Language Models for Code Understanding and Generation" @@ -6,7 +5,7 @@ authors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2305.07922"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2305.07922"} tags: ["Transformer"] --- Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs. diff --git a/_publications/xia2023universal.markdown b/_publications/xia2023universal.markdown index ac8789e1..0f20b845 100644 --- a/_publications/xia2023universal.markdown +++ b/_publications/xia2023universal.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Universal Fuzzing via Large Language Models" @@ -6,7 +5,7 @@ authors: Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, Ling conference: year: 2023 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2308.04748"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2308.04748"} tags: ["fuzzing"] --- Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 76 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 47 bugs already confirmed by developers as previously unknown. diff --git a/_publications/yin2022natural.markdown b/_publications/yin2022natural.markdown index bd44ea68..da39d6cf 100644 --- a/_publications/yin2022natural.markdown +++ b/_publications/yin2022natural.markdown @@ -1,4 +1,3 @@ - --- layout: publication title: "Natural Language to Code Generation in Interactive Data Science Notebooks" @@ -6,7 +5,7 @@ authors: Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kense conference: year: 2022 additional_links: -- {name: "ArXiV", url: "https://arxiv.org/abs/2212.09248"} + - {name: "ArXiV", url: "https://arxiv.org/abs/2212.09248"} tags: ["notebook", "evaluation"] --- Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.