Reassessing Automatic Evaluation Metrics for Code Summarization Tasks (ESEC/FSE 2021 - Research Papers)

Who

Devjeet Roy, Sarah Fakhoury, Venera Arnaoudova

Track

ESEC/FSE 2021 Research Papers

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Aug 2021 08:00 - 08:10 - Analytics & Software Evolution—Metrics Chair(s): Christof Ebert
Thu 26 Aug 2021 20:00 - 20:10 - Analytics & Software Evolution—Metrics Chair(s): Tushar Sharma, Alexander Chatzigeorgiou

Abstract

In recent years, research in the domain of source code summarization has adopted data-driven techniques pioneered in machine translation (MT). Automatic evaluation metrics such as BLEU, METEOR, and ROUGE, are fundamental to the evaluation of MT systems and have been adopted as proxies of human evaluation in the code summarization domain. However, the extent to which automatic metrics agree with the gold standard of human evaluation has not been evaluated on code summarization tasks. Despite this, marginal improvements in metric scores are often used to discriminate between the performance of competing summarization models.
In this paper, we present a critical exploration of the applicability and interpretation of automatic metrics as evaluation techniques for code summarization tasks. We conduct an empirical study with 226 human annotators to assess the degree to which automatic metrics reflect human evaluation. Results indicate that metric improvements of less than 2 points do not guarantee systematic improvements in summarization quality, and are unreliable as proxies of human evaluation.
When the difference between metric scores for two summarization approaches increases but remains within 5 points, some metrics such as METEOR and chrF become highly reliable proxies, whereas others, such as corpus BLEU, remain unreliable. Based on these findings, we make several recommendations for the use of automatic metrics to discriminate model performance in code summarization.

Link to Preprint

https://sarahfakhoury.com/2021-FSE-Summarization-Metrics.pdf

DOI

https://doi.org/10.1145/3468264.3468588

Devjeet Roy

Washington State University

United States

Sarah Fakhoury

Washington State University

United States

Venera Arnaoudova

Washington State University

United States

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 26 Aug
Displayed time zone: Athens change

08:00 - 09:00	Analytics & Software Evolution—MetricsResearch Papers / Journal First +12h Chair(s): Christof Ebert Vector Consulting

08:00 10m Research paper		Reassessing Automatic Evaluation Metrics for Code Summarization Tasks Research Papers Devjeet Roy Washington State University, Sarah Fakhoury Washington State University, Venera Arnaoudova Washington State University DOI Pre-print
08:10 10m Paper		A Defect Estimator for Source Code: Linking Defect Reports with Programming Constructs Usage Metrics Journal First Ritu Kapur University of Sannio, Balwinder Sodhi Indian Institute of Technology (IIT) Ropar, Punjab, India. Link to publication DOI Pre-print
08:20 5m Paper		Explaining Essential and Accidental Code Elements and Their Influences on Code Complexity Increase Journal First Vard Antinyan Volvo Car Group
08:25 35m Live Q&A		Q&A (Analytics & Software Evolution—Metrics) Research Papers

20:00 - 21:00	Analytics & Software Evolution—MetricsJournal First / Research Papers Chair(s): Tushar Sharma Siemens Research, Alexander Chatzigeorgiou University of Macedonia

20:00 10m Research paper		Reassessing Automatic Evaluation Metrics for Code Summarization Tasks Research Papers Devjeet Roy Washington State University, Sarah Fakhoury Washington State University, Venera Arnaoudova Washington State University DOI Pre-print
20:10 10m Paper		A Defect Estimator for Source Code: Linking Defect Reports with Programming Constructs Usage Metrics Journal First Ritu Kapur University of Sannio, Balwinder Sodhi Indian Institute of Technology (IIT) Ropar, Punjab, India. Link to publication DOI Pre-print
20:20 5m Paper		Explaining Essential and Accidental Code Elements and Their Influences on Code Complexity Increase Journal First Vard Antinyan Volvo Car Group
20:25 35m Live Q&A		Q&A (Analytics & Software Evolution—Metrics) Research Papers