Abstract
Despite being a dependable storage solution, storage area networks (SANs) may suffer from cascading failures caused by overloading, hindering the required quality of service delivered to users or even incurring significant financial losses. This work proposes mitigation strategies based on inverse-proportional load reallocation to alleviate the risk of cascading failures. Upon reaching the workload threshold of a SAN switch, reallocation is triggered among a set of switches selected based on their load or reliability levels. The reliability of SAN switches is assessed in accordance with the accelerated failure-time model. The overall SAN reliability is assessed using the analytical model of binary decision diagrams. Using a detailed case study of a mesh SAN, mitigation strategies using static and dynamic workload thresholds are considered and compared. Effects of the step size used in dynamic schemes are examined. Inverse proportional and proportional reallocation strategies are also compared. Based on the observations from the case study, it is recommended that for mesh SANs, using the dynamic threshold-based scheme with a lower step value and the load-sensitive node selection rule is more effective for mitigating the risk of cascading failures due to overload conditions and improving the overall SAN reliability. Moreover, the load data of switches is more readily available than the reliability data, making the load-sensitive node selection rule more feasible for the mitigation implementation than the reliability-sensitive selection rule.