Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution: A Journal First Presentation Proposal (ESEC/FSE 2021 - Journal First)

Who

Yangguang Li, Zhen Ming (Jack) Jiang, Heng Li, Ahmed E. Hassan, Cheng He, Ruirui Huang, Zhengda Zeng, Mian Wang, PIN AN CHEN

Track

ESEC/FSE 2021 Journal First

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 25 Aug 2021 11:10 - 11:20 - Analytics & Software Evolution—Continuous Integration and Delivery Chair(s): Fiorella Zampetti
Wed 25 Aug 2021 23:10 - 23:20 - Analytics & Software Evolution—Continuous Integration and Delivery Chair(s): Gustavo Pinto

Abstract

Many software services are nowadays hosted on cloud computing platforms, like Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormous size of the monitoring data and the complexity of the failure symptoms. AIOps, a recently introduced approach in DevOps, leverages data analytics and machine learning (ML) to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing ML model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this paper we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.

Yangguang Li

York University

Zhen Ming (Jack) Jiang

York University

Canada

Heng Li

Polytechnique Montréal

Canada

Ahmed E. Hassan

Queen's University

Canada

Cheng He

Alibaba Group

Ruirui Huang

Alibaba Group, China

Zhengda Zeng

Alibaba Group

Mian Wang

Alibaba Group

PIN AN CHEN

Alibaba

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 25 Aug
Displayed time zone: Athens change

11:00 - 12:00	Analytics & Software Evolution—Continuous Integration and DeliveryResearch Papers / Journal First +12h Chair(s): Fiorella Zampetti University of Sannio, Italy

11:00 10m Paper		Accelerating Continuous Integration by Caching Environments and Inferring Dependencies Journal First Keheliya Gallaba McGill University, John Ewart YourBase Inc., Yves Junqueira YourBase Inc., Shane McIntosh McGill University
11:10 10m Paper		Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution: A Journal First Presentation Proposal Journal First Yangguang Li York University, Zhen Ming (Jack) Jiang York University , Heng Li Polytechnique Montréal, Ahmed E. Hassan Queen's University, Cheng He Alibaba Group, Ruirui Huang Alibaba Group, China, Zhengda Zeng Alibaba Group, Mian Wang Alibaba Group, PIN AN CHEN Alibaba
11:20 10m Paper		Automating Serverless Deployments for DevOps Organizations Research Papers Daniel Sokolowski TU Darmstadt, Pascal Weisenburger TU Darmstadt, Guido Salvaneschi University of St. Gallen DOI Pre-print
11:30 30m Live Q&A		Q&A (Analytics & Software Evolution—Continuous Integration and Delivery) Research Papers

23:00 - 00:00	Analytics & Software Evolution—Continuous Integration and DeliveryJournal First / Research Papers Chair(s): Gustavo Pinto Federal University of Pará (UFPA) and Zup Innovation

23:00 10m Paper		Accelerating Continuous Integration by Caching Environments and Inferring Dependencies Journal First Keheliya Gallaba McGill University, John Ewart YourBase Inc., Yves Junqueira YourBase Inc., Shane McIntosh McGill University
23:10 10m Paper		Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution: A Journal First Presentation Proposal Journal First Yangguang Li York University, Zhen Ming (Jack) Jiang York University , Heng Li Polytechnique Montréal, Ahmed E. Hassan Queen's University, Cheng He Alibaba Group, Ruirui Huang Alibaba Group, China, Zhengda Zeng Alibaba Group, Mian Wang Alibaba Group, PIN AN CHEN Alibaba
23:20 10m Paper		Automating Serverless Deployments for DevOps Organizations Research Papers Daniel Sokolowski TU Darmstadt, Pascal Weisenburger TU Darmstadt, Guido Salvaneschi University of St. Gallen DOI Pre-print
23:30 30m Live Q&A		Q&A (Analytics & Software Evolution—Continuous Integration and Delivery) Research Papers