Azure SQL Data Warehouse: New Features and New Benchmark
Microsoft’s drive to put Azure SQL Data Warehouse on more equal footing with SQL Server is finally paying off.
Azure SQL Data Warehouse gets less press than its online transaction processing brethren, Azure SQL Database and Azure Cosmos DB. However, it is a powerful cloud engine for processing large volumes of data, and offers a rich set of connectors to the rest of the Azure Big Data services like machine learning and Azure Databricks.
In recent months, Microsoft has made significant feature additions to the platform and improved the performance of the service.
If you aren’t familiar with Azure SQL Data Warehouse, it is what’s known as a massively parallel processing (MPP) system. MPP systems have been around for a long time, but have traditionally been very expensive and required large hardware investments. Some vendors in this space include Teradata, Netezza and Microsoft with its Analytics Platform System offering. In this model, data from tables is distributed across nodes and the results are joined in the head or control node. It is a model that is completely optimized for large-scale loading of data, as well as reporting.
As you may know, data warehouses are broken down into fact and dimension tables. Fact tables are commonly generated in a transactional system (think point of sale) and then loaded into the data warehouse. Dimension tables contain attributes such as dates or product names. These may change infrequently and are typically much smaller than the fact tables.
In Azure SQL Data Warehouse, fact tables are distributed across nodes using a hash column, while smaller dimensions are replicated to all nodes and larger dimensions use the same hash distribution. The goal of this is to minimize data movement between nodes; while it is extremely fast to read data from the nodes, having to do cross-node look-ups is very costly, so designs typically aim to minimize this.
Introducing Azure SQL Data Warehouse Gen2
There have been steady performance improvements in Azure SQL Data Warehouse since the product was introduced. The first was moving from standard Azure storage to premium (SSD), which happened pretty early in the lifecycle of the service.
Last year, Microsoft also announced Gen2 of the hardware for the product — the “Compute Optimized” tier, which includes caching data to super-fast local NVMe drives while still storing the larger volume of data on networked premium storage. This allows the service to deliver up to 2GB per second of local I/O bandwidth, and up to 5x query performance improvements.
One of the early complaints about SQL Data Warehouse was that many of the features that Microsoft has included in recent editions of SQL Server haven’t made their way into the service. Microsoft has addressed that with a number of key features.
The query store is my self-proclaimed favorite addition to the SQL portfolio in recent years. Introduced in SQL Server 2016, the query store is a data collection utility that captures data about query execution plans and runtime performance information, and stores them in the database for later examination and review.
In early incarnations of the service, it was extremely challenging to gather execution plans. The addition of the query store gives administrators the ability to review what happened after it happened, and not just during live execution.
This data can also be used to isolate issues with changes in data that may require updates to column statistics. Most importantly, the query store includes an easy-to-use interface that is built into SQL Server Management Studio.
The ability to allocate server resources by who is running the query can be important in enterprise workloads. While SQL Data Warehouse always had an element of management with resource classes that could be defined within a session, there was no dynamic control.
In Gen 2, resource classes are based on a percentage of memory in proportion to the service level. This means if you scale your SQL Data Warehouse up or down, the amount of memory in each class will change (small resource class excepted).
Another recent introduction is resource governor. In SQL Server, this feature dates back to SQL Server 2008. It allows administrators to control I/O, memory and CPU with a high degree of granularity, based on incoming user ID or calling program. This feature, along with row-level security (RLS), can be particularly useful for managing multitenant environments.
The name of this feature is somewhat self-explanatory, but it uses a function and a predicate, which is typically based on either a client or user ID to limit access to specific rows. RLS was one of many new security features introduced in SQL Server 2016.
I have personally run into several customer scenarios where the customer wanted to host a multitenant data warehouse in SQL Data Warehouse and was blocked by a lack of RLS. While it was possible to implement multiple tenants without RLS, doing so required a re-architecture of your schema, which sometimes came at the expense of performance.
Support for Visual Studio Database Projects
Visual Studio has long offered database projects as a way to manage data definition language (DDL) source code for both SQL Server and Azure SQL Database as part of SQL Server Data Tools (SSDT).
This is a key piece of tooling especially for larger organizations with many developers working against the same database, and allows for better source control integration. This support allows for a unified DDL deployment process that integrates with SQL Data Warehouse, and eases refactoring or making changes to the schema. This especially benefits SQL Data Warehouse because of its broad use of external tables for ETL processing.
All of these feature enhancements demonstrate that Microsoft continues to make major investments in Azure SQL Data Warehouse. Microsoft (in conjunction with GigaOm) recently announced a benchmark against Google BigQuery and Amazon Web Services (AWS) Redshift demonstrating a much better price performance ratio compared to those two competitors.
This benchmark, along with the added features and the exciting product roadmap, continue to make Azure SQL Data Warehouse an interesting offering for business intelligence workloads in Azure.
Joseph D’Antoni is an Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant for Denny Cherry and Associates Consulting. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. Joey is the co-president of the Philadelphia SQL Server Users Group . He is a frequent speaker at PASS Summit, TechEd, Code Camps, and SQLSaturday events.