Abstract
Storage area networks (SAN) provide an effective solution to the significant growth issue in remote data storage and access. To deliver the desired quality of service, the reliability challenges of SANs must be addressed. A major threat to SAN reliability and performance is cascading failures, where a single incident triggers a chain reaction, causing extensive damage and even crash of the entire system. In this dissertation research, we focus on overload-triggered cascading failures, where the overloading of one device (e.g., a switch) causes it to fail, reallocating its workload to other devices, which in turn become overloaded, leading to further failures in a domino effect. We first investigate the effects of data loading on the reliability of an individual switch device in SANs using the proportional-hazards model and accelerated failure time model. We then investigate the effects of loading on the reliability of an entire SAN through dynamic fault trees and binary decision diagrams-based analysis. Furthermore, to enhance SAN reliability, we design proactive load redistribution-based mitigation strategies that aim to prevent cascading failures during the specified mission time, or at least alleviate the consequence of such failures. Two triggering mechanisms, based on the overall SAN reliability and switch loading, are considered. Load-based and reliability-based node selection rules are explored. Additionally, traffic reallocation strategies are investigated to enhance SAN performance in terms of load balancing and overall response time. The performance metrics of switch utilization, switch response time, and overall response time are analyzed using Jackson queueing networks. The application and effectiveness of the proposed mitigation strategies are demonstrated and compared through detailed case studies of SANs with a mesh topology.