“One of the foundations of the Azure cloud computing platform is the availability of the Critical Environment (CE) infrastructure that powers and cools the IT infrastructure in our data centers around the world. For today’s post in our Increase in reliability Series, I asked the head of reliability engineering Randy Kong from our Cloud Operations and Innovation Engineering team to explain how we identify and mitigate the various risks associated with these critical systems. What follows is an exciting highlight role that Microsoft and our partners are taking on in this area to ensure first-class reliability for the critical applications that our customers and partners run on the Azure platform. ”-Mark Russinovich, CTO, Azure
There are many factors that can affect the availability of the infrastructure in critical environments (CE) – the reliability of the infrastructure building blocks, the controls during the construction phase of the data center, effective systems monitoring and event detection schemes, a robust maintenance program and operational excellence to ensure that each Action is taken after careful consideration of the associated risks.
Microsoft is not only an industry leader in the development, construction, commissioning and operation of a highly available CE infrastructure, but also invests in the development and implementation of best practices in the field of reliability technology – with the aim of increasing the availability of the CE infrastructure and maintain. In general, “Reliability Engineering” in this blog refers to the technical discipline that focuses on the proactive identification and assessment of time-latent risks that may affect system functionality over the expected useful life – with various failure modes, effects, and mechanisms. A simple example: when a data center is cooled with more energy efficient free air cooling, we ensure that the CE infrastructure provides a controlled temperature, humidity and air quality environment. One of the purposes is to prevent corrosion-related failures, as electrical short circuits or interruptions under unsuitable conditions, such as e.g.
The CE for hyperscale cloud data centers typically consists of a mix of standard components from different vendors – devices such as uninterruptible power supplies (UPS), power distribution units (PDU), automatic transfer switches (ATS), air handling units (AHU) and generators. This cross-vendor and cross-technology environment leads to intrinsic complexity and fluctuations in robustness, which mainly depend on the competencies and experience of the provider. The monitoring of the status of the CE infrastructure is also limited, for example due to insufficient information about these standard components. The monitoring is therefore mainly based on the detection of “error effects” at a high level, for example from an electrical energy management system (EPMS) or a building automation system (BAS).
However, detection schemes based on fault effects generally make it difficult to quickly identify and locate the offending element, which may delay the quick recovery of services when faults occur. The scale of global operations brings with it additional challenges – as variations in local conditions can result in different loads on the equipment – which challenges the effectiveness of the typically time-based preventive maintenance approach. With more significant changes in temperature, humidity, or air quality in some geographic areas, a time-based approach to maintenance could become wasteful or miss the opportunity to “refresh” system reliability if the rate of degradation is not accounted for under the appropriate stressful conditions.
Investment in CE reliability technology
To address these types of challenges, Microsoft invested in setting up the CE reliability engineering function. Much of the related reliability engineering and technologies have emerged from the electronics industry over the past few decades, and there are numerous opportunities to customize, expand, and innovate best reliability practices and solutions for the data center CE infrastructure. In the case of off-the-shelf CE components, for example, we sought a close partnership with CE providers for the application-specific risk assessment of reliability during our device selection and qualification phase. During the operational phase, we also work closely with our supplier base to drive continuous improvement through in-depth physical and data analysis – from cause analysis for individual cases to fleet-wide health behavior data analysis. These partnerships help both Microsoft and our vendors have clear, data-driven insight into relevant reliability trends, areas of potential risk, and underlying factors so that effective solutions can be initiated, often proactively, before more severe impacts occur. While improving understanding of potential CE infrastructure risks, reliability engineering also focuses on research and development (R&D) efforts that can result in more proactive and effective detection of infrastructure health, e.g. areas. Both internal and external partnerships have been established in the research and development of relevant approaches, from data mining and machine learning to physics-of-failure (PoF) methods.
As Microsoft continues to innovate in the CE space, reliability engineering plays a key role in ensuring the robustness of these solutions by promoting both built-in and built-in reliability through early risk analysis and mitigation. For example the Microsoft Sphere-based IoT solution is designed to securely collect data from the mechanical and electrical energy CE system. Reliability Engineering works closely with internal and external product design and manufacturing partners to apply both analytical and PoF-based testing approaches throughout the entire phases of solution conception, prototype, design, process development and product delivery. A typical example is concerns about packaging defects in electronics during their manufacture or assembly, or their lifecycle. A simulation tool based on finite element analysis (FEA) was used to identify thermal-mechanical stress points, even if the construction is only a drawing on paper, as these stress points can lead to failures within the expected service life. These points are then precisely tracked and characterized (e.g. with strain gauges) during environmental stress tests or during manufacturing process steps that can introduce the corresponding thermomechanical stresses. Even if the system is still functional after these loads, the samples are physically severed at critical contact points in order to identify the early development of defects. These in-depth analyzes enable simultaneous changes to the product or process design in order to eliminate the probability of failure and thus increase any system availability.
Similar design-for-excellence (DFX) strategies are also being explored for the complex CE infrastructure itself to enable proactive risk identification and prevention options – and wherever possible, before the infrastructure is physically deployed. These CE-related investments and technological advances in reliability engineering will help Microsoft’s CE infrastructure build additional robustness mechanisms to meet our customers’ availability expectations for a world-class cloud computing service.