A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback (AIware 2024 - Main Track)

Who

Ummay Kulsum, Haotian Zhu, Bowen Xu, Marcelo d'Amorim

Track

AIware 2024 Main Track

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 15 Jul 2024 16:10 - 16:20 at Mandacaru - Security and Safety + Round Table + Day1 Closing Chair(s): Thomas Zimmermann, Ahmed E. Hassan

Abstract

Recent work in automated program repair (APR) proposes the use of reasoning and patch validation feedback to reduce the semantic gap between the LLMs and the code under analysis. The idea has been shown to perform well for general APR, but its effectiveness in other particular contexts remains underexplored.

In this work, we assess the impact of reasoning and patch validation feedback to LLMs in the context of vulnerability repair, an important and challenging task in security. To support the evaluation, we present VRpilot, an LLM-based vulnerability repair technique based on reasoning and patch validation feedback. VRpilot (1) uses a chain-of-thought prompt to reason about a vulnerability prior to generating patch candidates and (2) iteratively refines prompts according to the output of external tools (e.g., compiler, code sanitizers, test suite, etc.) on previously generated patches.

To evaluate performance, we compare VRpilot against the state-of-the-art vulnerability repair techniques for C and Java using public datasets from the literature. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java, respectively. We show, through an ablation study, that reasoning and patch validation feedback are critical. We report several lessons from this study and potential directions for advancing LLM-empowered vulnerability repair.

DOI

https://doi.org/10.1145/3664646.3664770

Ummay Kulsum

North Carolina State University

United States

Haotian Zhu

Singapore Management University

Singapore

Bowen Xu

North Carolina State University

United States

Marcelo d'Amorim

North Carolina State University

United States

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 15 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 18:00	Security and Safety + Round Table + Day1 ClosingMain Track / Late Breaking Arxiv Track at Mandacaru Chair(s): Thomas Zimmermann Microsoft Research, Ahmed E. Hassan Queen’s University

16:00 5m Paper		An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping Main Track Boming Xia CSIRO's Data61 & University of New South Wales, Qinghua Lu Data61, CSIRO, Liming Zhu CSIRO’s Data61, Zhenchang Xing CSIRO's Data61 DOI
16:05 5m Paper		Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code Main Track Aftab Hussain University of Houston, Rafiqul Rabin University of Houston, Amin Alipour University of Houston DOI
16:10 10m Paper		A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback Main Track Ummay Kulsum North Carolina State University, Haotian Zhu Singapore Management University, Bowen Xu North Carolina State University, Marcelo d'Amorim North Carolina State University DOI
16:20 5m Paper		Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy Late Breaking Arxiv Track Aftab Hussain University of Houston, Rafiqul Rabin University of Houston, Toufique Ahmed University of California at Davis, Bowen Xu North Carolina State University, Premkumar Devanbu UC Davis, Amin Alipour University of Houston Pre-print
16:25 25m Live Q&A		Session Q&A and topic discussions Main Track
16:50 60m Panel		Round Table Main Track
17:50 10m Day closing		Day 1 summary and closing Main Track