Abstract
As a fault-tolerant data storage model, Cloud-RAID (Redundant Array of Independent Disks) utilizes diverse redundancy techniques to enhance data reliability and availability in an anytime, anywhere data access framework implemented in the cloud environment. Performance and operation of physical disk drives in a cloud-RAID storage system are managed by and dependent on a controller. In addition, each disk may exhibit multiple performance levels varying from perfect function to complete failure. The inter-component dependent behavior and the multistate behavior pose challenges to the reliability analysis of cloud-RAID storage systems. Further complicating the cloud-RAID reliability analysis is the imperfect coverage (IPC) behavior where due to malfunction of the system recovery mechanism, an uncovered disk fault may propagate and cause extensive damages to the whole system despite the presence of adequate redundancy. Failure to address this behavior leads to inaccurate, often overestimated system reliability results, misleading system design, operation and optimization activities. Existing works on the cloud-RAID system reliability have typically assumed fully reliable fault detection and recovery mechanisms (i.e., perfect fault coverage), which is rarely true in real-world systems. This dissertation research relaxes this assumption through decision diagrams based combinatorial approaches for the reliability analysis of the cloud-RAID storage systems subject to the IPC. The proposed methods are applicable to homogenous or heterogeneous disks with arbitrary types of time-to-failure distributions. Two different IPC models are considered: element level coverage (ELC) where effectiveness of the system recovery mechanism and thus the fault coverage probability rely on the occurrence of each individual disk fault, and fault level coverage (FLC) where the coverage probability relies on the number of disk faults happening to a particular group within a certain recovery window. Both binary-state and multi-state disk and system models are addressed. Cloud-RAID 5 and cloud-RAID 6 systems are analyzed as case studies to illustrate effects of dependence, multistate and imperfect coverage factors. This dissertation research also considers the cloud provider selection problem under both ELC and FLC, which finds the combination of cloud disk providers maximizing system unreliability or minimizing system cost. Both unconstrained and constrained optimization problems are considered. Several case studies are performed to illustrate the proposed optimization problems and solution methods.