Exploit Those Code Reviews! Bigger Data for Deeper Learning
Wed 25 Aug 2021 21:20 - 21:25 - Analytics & Software Evolution—Code Reviews and Changes Chair(s): Emad Aghajani
Modern code review (MCR) processes are prevalent in most
organizations that develop software due to benefits in quality
assurance and knowledge transfer. With the rise of collaborative
software development platforms like GitHub and
Bitbucket, today, millions of projects share not only their
code but also their review data. Although researchers have
tried to exploit this data for more than a decade, most of
that knowledge remains a buried treasure. A crucial catalyst
for many advances in deep learning, however, is the accessibility
of large-scale standard datasets for different learning
tasks. This paper presents the ETCR (Exploit Those Code
Reviews!) infrastructure for mining MCR datasets from any
GitHub project practicing pull-request-based development.
We demonstrate its effectiveness with ETCR-Elasticsearch,
a dataset of >231𝑘 review comments for >47𝑘 Java file revisions
in >40𝑘 pull-requests from the Elasticsearch project.
ETCR is designed with the challenge of deep learning in
mind. Compared to previous datasets, ETCR datasets include
all information for linking review comments to nodes
in the respective program’s Abstract Syntax Tree.