Would You Like a Quick Peek? Providing Logging Support to Monitor Data Processing in Big Data Applications (ESEC/FSE 2021 - Research Papers)

Who

Zehao Wang, Haoxiang Zhang, Tse-Hsun (Peter) Chen, Shaowei Wang

Track

ESEC/FSE 2021 Research Papers

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 25 Aug 2021 08:10 - 08:20 - Analysis—Development Tools Chair(s): Gunel Jahangirova
Wed 25 Aug 2021 20:10 - 20:20 - Analysis—Development Tools Chair(s): Rui Abreu

Abstract

To analyze large-scale data efficiently, developers have created various big data processing frameworks (e.g., Apache Spark). These big data processing frameworks provide abstractions to developers so that they can focus on implementing the data analysis logic. In traditional software systems, developers leverage logging to monitor applications and record intermediate states to assist workload understanding and issue diagnosis. However, due to the abstraction and the peculiarity of big data frameworks, there is currently no effective monitoring approach for big data applications. In this paper, we first manually study 1,000 randomly sampled Spark-related questions on Stack Overflow to study their root causes and the type of information, if recorded, that can assist developers with motioning and diagnosis. Then, we design an approach, DPLOG, which assists developers with monitoring Spark applications. DPLOG leverages statistical sampling to minimize performance overhead and provides intermediate information and hint/warning messages for each data processing step of a chained method pipeline. We evaluate DPLOG on six benchmarking programs and find that DPLOG has a relatively small overhead (i.e., less than 10% increase in response time when processing 5GB data) compared to without using DPLOG, and reduce the overhead by over 500% compared to the baseline. Our user study with 20 developers shows that DPLOG can reduce the needed time to debug big data applications by 63% and the participants give DPLOG an average of 4.85/5 for its usefulness. The idea of DPLOG may be applied to other big data processing frameworks, and our study sheds light on future research opportunities in assisting developers with monitoring big data applications.

Link to Preprint

https://petertsehsun.github.io/papers/fse2021_dplog.pdf

DOI

https://doi.org/10.1145/3468264.3468613

Zehao Wang

Concordia University

Canada

Haoxiang Zhang

Huawei

Canada

Tse-Hsun (Peter) Chen

Concordia University

Canada

Shaowei Wang

University of Manitoba

Canada

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 25 Aug
Displayed time zone: Athens change

08:00 - 09:00	Analysis—Development ToolsDemonstrations / Research Papers / Journal First +12h Chair(s): Gunel Jahangirova USI Lugano

08:00 10m Paper		DIFFBASE: A Differential Factbase for Effective Software Evolution ManagementBest Artifact Award Research Papers Xiuheng Wu Nanyang Technological University, Chenguang Zhu University of Texas at Austin, Yi Li Nanyang Technological University DOI Pre-print
08:10 10m Paper		Would You Like a Quick Peek? Providing Logging Support to Monitor Data Processing in Big Data Applications Research Papers Zehao Wang Concordia University, Haoxiang Zhang Huawei, Tse-Hsun (Peter) Chen Concordia University, Shaowei Wang University of Manitoba DOI Pre-print
08:20 5m Paper		Slicer4J: A Dynamic Slicer for Java Demonstrations Khaled Ahmed University of British Columbia, Mieszko Lis University of British Columbia, Julia Rubin University of British Columbia DOI Pre-print Media Attached
08:25 5m Paper		Information Needs: Lessons for Programming Tools Journal First Thomas LaToza George Mason University DOI Pre-print
08:30 30m Live Q&A		Q&A (Analysis—Development Tools) Research Papers

20:00 - 21:00	Analysis—Development ToolsJournal First / Demonstrations / Research Papers Chair(s): Rui Abreu University of Porto

20:00 10m Paper		DIFFBASE: A Differential Factbase for Effective Software Evolution ManagementBest Artifact Award Research Papers Xiuheng Wu Nanyang Technological University, Chenguang Zhu University of Texas at Austin, Yi Li Nanyang Technological University DOI Pre-print
20:10 10m Paper		Would You Like a Quick Peek? Providing Logging Support to Monitor Data Processing in Big Data Applications Research Papers Zehao Wang Concordia University, Haoxiang Zhang Huawei, Tse-Hsun (Peter) Chen Concordia University, Shaowei Wang University of Manitoba DOI Pre-print
20:20 5m Paper		Slicer4J: A Dynamic Slicer for Java Demonstrations Khaled Ahmed University of British Columbia, Mieszko Lis University of British Columbia, Julia Rubin University of British Columbia DOI Pre-print Media Attached
20:25 5m Paper		Information Needs: Lessons for Programming Tools Journal First Thomas LaToza George Mason University DOI Pre-print
20:30 30m Live Q&A		Q&A (Analysis—Development Tools) Research Papers