“Microsoft Azure has always been committed to providing highly reliable, available, and recoverable services to our customers. This customer-centric passion is reflected at all levels of the Microsoft Azure Business Continuity Management (BCM) program. At the direction of the Microsoft Board of Directors, we set up a comprehensive risk program in 2007, including the BCM program, in order to ensure that services can be reliably restored for our customers. For today’s post in our Increase reliability Series I asked Robert Arco, our Senior Program Manager, who oversees this program to explain how we are approaching Business Continuity Management in Azure and how we will continue to improve the program as our platform evolves. “-Mark Russinovich, CTO, Azure
How we define a “service” for our BCM program
If you ask three people what a service is, you might get three different answers. At Microsoft, we define a service (business process or technology) as a means of creating value for customers (first or third party) by enabling the results that customers want to achieve.
In order to guarantee the highest level of reliability for each of our “services”, we include:
- People: The people who are responsible for providing the service.
- Process: The methodology used to deliver the service.
- Technology: The tools used to deliver the service, or the technology itself, which is delivered as value.
Customers view our services as product offerings that consist of various bundled services. Each individual service is mapped in our inventory and run through the BCM program to ensure that the people, processes and technologies for these services are resilient to a wide variety of errors.
Our end-to-end program identifies, prioritizes, ranks and tests any service that does more than just ensure checkbox compliance. Instead, we focus on a thorough understanding of how we can provide the best service to our customers who demand reliable service offerings for their business.
How the BCM program is administered in practice
Through a sophisticated toolset, each service (both internal and external) is uniquely assigned and shared with a range of compliance tools that deal with data protection, security, BCM and more. This ensures that each service contains shareable metadata for other tools, regardless of type or criticality.
As part of this article, data sets are automatically ported to our BCM administration tool. There they are automatically checked for disaster recovery (DR) requirements that meet certifiable standards and meet our customer promises. These records contain the most well-known elements of a BCM program, including business impact analysis, dependencies, workforce, suppliers, recovery plans, and testing. We also provide insights into potential customer impact, detection capabilities, and readiness for failover.
No set of tools, guidelines, or documents can offer the same level of confidence in service recovery and sustainability as extensive testing. Azure services are tested at various levels, from individual unit tests to completing “region down” scenarios. Every service has to present a test proof and prove that its recovery fulfills the defined goals – both internally and what we guarantee our end customers in the Service Level Agreements (SLAs). Tabletop tests that only discuss simulated emergencies are not considered acceptable or compliant for our program.
Our most robust integrated testing takes place in our “Canarian” environment, which consists of two distinct regions of the manufacturing data center: one in the eastern United States and the other in the central United States.
Periodically, we test the recovery of services with a full shutdown of the zone or region (simulating a major loss of production or catastrophic loss), forcing all services to invoke their recovery plans. These tests not only check the recoverability of the service, but also test the processes of our incident response team to manage serious incidents. For availability zones, we test and verify the seamless continuation of service availability in the face of a total zone loss. These are end-to-end tests that include detection, response, coordination, and recovery.
All processes from detection to response and action are carried out as if it were a real event affecting the service. Service responders are the normal on-call engineers. In addition, we test synthetic customer responsibility features such as failover of virtual machines (VM) to paired regions to ensure that customer workloads can run in large failure scenarios.
Availability zones – our highest level of seamless availability
As more Azure Regions become zoning, our customers have additional options for the highest availability resilience supported by SLA and disaster recovery in the region without the need to fail over outside of the region. The benefits include:
- Customers can achieve the highest levels of availability and transparent recovery in a zone failure situation.
- Data is replicated synchronously – no data loss due to asynchronicity with another region.
- No potential for latency due to the removal of the secondary area.
Customers can use regional high availability remote disaster recovery for multiple regions or both. This “belt and braces” path provides the highest level of assurance that the services will be stable regardless of the impact. Linking the high availability of availability zones with the out-of-region option to a remote location is fail-safe for the most catastrophic regional events.
Just as we conduct robust regional disaster recovery testing, we take the same care with our zoned services. In our Canarian regions we are able to do end-to-end zone down-drills, which prove our ability to provide the best possible and most reliable service to our customers.
The Microsoft BCM program follows all industry and government standards: identifying services, calculating impact (recovery time target or recovery point target), dependency mapping, precise disaster recovery plans, and testing of those plans. These plans are reviewed at all levels and verified through extensive end-to-end testing.
The program itself has received dozens of industry and government certifications, including ISO 22301, which is the highest standard a program can achieve. So far, Azure is the only cloud service provider to have achieved this rating.
Azure was able to achieve these ratings by ensuring that we have the following elements in place to maintain a successful and value-adding program:
- Leadership support and awareness at all levels.
- Comprehensive guidelines, standards and training documentation.
- Dedicated BCM practitioners with experience in driving a well-engineered program.
- Transparent reporting and gap analysis ensure well-founded decisions.
- Comprehensive testing of services to make sure what we measure is correct.
- Modern tools ensure a high level of scalability and guarantee compliance with the program.
The Microsoft BCM program is one of the most sophisticated in any industry. Not only has it demonstrated its commitment to meeting regulatory and compliance requirements, but it has also proven to be customer-centric to ensure highly available and reliable services. Additionally, by adding availability zones to the mix, our customers can get the highest level of transparent service availability without making disaster recovery more effective from region to region.
While we offer highly resilient customer solutions, our program progresses step by step. In 2021, we expanded our test frequencies and end-to-end test areas to ensure that we can capture deficiencies (if any) and drive program correction forward. This includes advanced Availability zone Region-by-region testing, as well as bug and recovery.