Abstract
The cloud computing paradigm, which allows offloading of computing resource requirements to existing data centers, has seen increasing success in recent years across a range of areas such as the Internet of Things, e-commerce, healthcare, and military applications. As developers look to migrate critical applications to the cloud, there has been a pressing need for high reliability and availability in cloud computing. Methods to prevent or recover from the occurrence of unpredictable and apparently non-deterministic bugs (known as Mandelbugs) have been particularly important. In this thesis, we use virtualized software spares and rejuvenation scheduling to maintain a highly reliable cloud-based software platform and combat Mandelbugs in cloud systems. The challenge of our approach is to develop real-time rejuvenation schedules for interconnected software components with ever-changing reliabilities. To achieve this goal, we integrate preventive and recovery strategies to mitigate the harmful effects of Mandelbugs by tuning the reliability model of virtual software components and scheduling software rejuvenations in real time. The approach supports the reliability calculation of a primary component with up to two virtual hot spares, where a software component may change state due to its workload or other operating conditions. Furthermore, by using Dynamic Fault Tree analysis, the calculated reliabilities of connected components can be composed to derive the reliability of complex cloud services with many interrelated parts while avoiding the state-space explosion problem. Finally, we present a case study of a cloud-based electronic health records(EHR) system with virtualized software spares, and the simulation results show rejuvenation schedules can be readily generated and updated in real time in typical cloud service scenarios related to software reliability.