TY - CHAP
T1 - Co-designing the failure analysis and monitoring of large-scale systems
AU - Chandra, Abhishek
AU - Prinja, Rohini
AU - Jain, Sourabh
AU - Zhang, Zhi Li
PY - 2008
Y1 - 2008
N2 - Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.
AB - Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.
UR - http://www.scopus.com/inward/record.url?scp=77956459832&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77956459832&partnerID=8YFLogxK
U2 - 10.1145/1453175.1453178
DO - 10.1145/1453175.1453178
M3 - Chapter
AN - SCOPUS:77956459832
VL - 36
SP - 10
EP - 15
BT - Performance Evaluation Review
PB - Association for Computing Machinery
ER -