Benchmarking Automated GUI Testing for Android against Real-World Bugs (ESEC/FSE 2021 - Research Papers)

Who

Ting Su, Jue Wang, Zhendong Su

Track

ESEC/FSE 2021 Research Papers

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Aug 2021 12:10 - 12:20 - Human Aspects—HCI and Mobile Chair(s): Jürgen Cito
Fri 27 Aug 2021 00:10 - 00:20 - Human Aspects—HCI and Mobile Chair(s): Gustavo Pinto

Abstract

For ensuring the reliability of Android apps, there has been tremendous, continuous progress
on improving automated GUI testing in the past decade.
Specifically, dozens of testing techniques and tools
have been developed and demonstrated to be effective
in detecting crash bugs and outperform their respective prior work
in the number of detected crashes.
However, an overarching question ``How effectively and thoroughly
can these tools find crash bugs in practice?'' has not been
well-explored, which requires a ground-truth benchmark with real-world bugs.
Since prior studies focus on tool comparisons w.r.t some selected apps, they cannot
provide direct, in-depth answers to this question.

To complement existing work and tackle the above question,
this paper offers the first ground-truth empirical evaluation of
automated GUI testing for Android.
To this end, we devote substantial manual effort to
set up the Themis benchmark set, including (1) a carefully constructed dataset
with 52 real, reproducible crash bugs (taking
two person-months for its collection and validation), and (2)
a unified, extensible infrastructure with six
recent state-of-the-art testing tools.
The whole evaluation has taken over 10,920 CPU hours.
We find a considerable gap in these tools
finding the collected real bugs — 18 bugs cannot be detected by any tool.
Our systematic analysis further identifies five major common
challenges that these tools face, and reveals additional findings
such as factors affecting these tools in bug finding and opportunities
for tool improvements.
Overall, this work offers new concrete insights, most of which
are previously unknown/unstated and difficult to obtain.
Our study presents a new, complementary perspective from prior
studies to understand
and analyze the effectiveness of existing testing tools,
as well as a benchmark for future research on this topic.
The Themis benchmark is publicly available at
https://github.com/the-themis-benchmarks/home.

Link to Preprint

https://tingsu.github.io/files/fse21-themis.pdf

DOI

https://doi.org/10.1145/3468264.3468620

Ting Su

East China Normal University

China

Jue Wang

Nanjing University

China

Zhendong Su

ETH Zurich

Switzerland

Artifact