Online transaction processing (OLTP) systems often include high-throughput, high-transaction-rate applications— like order entry and fulfillment — where more transactions per minute and highly available platforms can translate into more satisfied customers and more revenue.
Traditional OLTP platform designs that use complex server configurations with external shared storage arrays to try to improve uptime still struggle with poor overall performance. Alternatively, using high-performance, server-side SSDs like the Micron® 9100 MAX and Microsoft® SQL Server® 2014 AlwaysOn Availability Groups is a simpler solution that can bring new levels of OLTP performance and application durability. In this technical brief, we show how a highly available (HA), two-node SQL Server configuration with two 9100 MAX NVMe™ SSDs per node reached 1.8 million transactions per minute (TPM), and a single-node configuration reached 3.7 million transactions per minute.
Testing 9100 MAX-Equipped Microsoft SQL Platforms
The goal of our testing was to find a simple configuration that delivers as many TPM for OLTP systems as possible. Based on the 9100 MAX’s known I/O capability, we expected high throughput — so we needed a powerful platform to load both the two-node HA cluster as well as the single-node configuration. That’s why we built our load generation platforms using commodity-off-the shelf (COTS) 2U, two-socket servers — each with a pair of Intel® Xeon® E5-2690 v3 CPUs and 256GB of Micron DRAM.
As shown in Table 1, our test database had 5000 warehouses — each consisting of 10 districts of 3000 customers and 100,000 items in stock available to fulfill new orders. This test database (including indexes) required roughly 600GB of memory — which was significantly more than the available system memory, making the storage workload predominantly read-focused.
Table 1: Test Database
We began recording the results for each test run only after the test reached steady-state performance (including a fixed ramp time, duration and exit condition). To help ensure consistent, run-to-run results — and to ensure prior tests did not influence subsequent testing — we began each test by restoring the 9100 MAX SSDs to a fresh-out–of-box (FOB) state.
We measured all results using transactions per minute (TPM) — a common measure of business throughput for databases enabling performance comparisons across database types. We used 193 users, which enabled the platform to reach maximum performance, and allowed us to push the storage and database platform limits.
9100 MAX in a Two-Node, HA Configuration: 1.8 Million TPM and 13.1ms Average Latency
Recognizing that many OLTP deployments rely on HA database clusters, we wanted to understand how a two-node, HA Microsoft SQL cluster (using AlwaysOn Availability Groups) would perform with two 9100 MAX NVMe SSDs per server.
A Simple Configuration
We built our two-node, HA cluster using a pair of basic server platforms: 2U, two-processor COTS servers (each with two Intel Xeon E5-2690 processors and Micron 256GB DRAM). To keep the OS tasks separate from database storage tasks, we installed Windows Server® 2012 R2 to a dedicated boot array in each server’s database, and we added a pair of Micron 1.2TB 9100 MAX NVMe SSDs (for the database, log files and tempdb) in each server. We used Microsoft Storage Spaces for host-based RAID across the pair of 9100 MAX SSDs and linked the database servers via a pair of 40 GbE Ethernet ports. We configured the two nodes as a Windows Server Failover Cluster (WSFC) with:
- Two database servers added as nodes to the WSFC
- A majority node quorum with a file share witness (configured on the domain controller virtual machine)
- A cluster resource group (two server nodes, cluster virtual IP address and virtual network name)
We also used an AlwaysOn Availability Group configuration following Microsoft documentation.1, 2
As shown in Table 2, with a pair of 9100 MAX SSDs and a Storage Spaces RAID 1 configuration in each server, our two-node HA test cluster maintained 1.8 million TPM with an average latency of 13.1ms and a 99th percentile latency of 593ms using the test database with 193 users. It proved to be a very fast, responsive and consistent HA platform!
||Average Latency (ms)
||99th Percentile Latency (ms)
|Two-Node HA cluster
Table 2: Two-Node HA Performance With 193 Users
9100 MAX in a Single-Node Configuration: 3.7 Million TPM and 6.7ms Average Latency
Because HA isn’t a requirement for all deployments, we also wanted to understand the potential performance with a single-node Microsoft SQL OLTP platform and a pair of 9100 MAX SSDs.
Single-Node Configuration Details
For this test, we used essentially half of our two-node HA configuration: a single 2U, two-processor COTS server (with two Intel Xeon E5-2690 processors and 256GB DRAM). As before, we installed the OS to a dedicated hardware RAID 1 boot array and used a pair of Micron 1.2TB 9100 MAX NVMe SSDs in a Microsoft Storage Spaces host-based RAID 1 array for the database, tempdb working space and associated log files.
This single Microsoft SQL Server platform also showed remarkable TPM with low and consistent latency. With the same performance-pushing 193 users, we measured 3.7 million TPM with an average latency of just 6.7ms and a 99th percentile response time of just over 444ms, as shown in Table 3.
||Average Latency (ms)
||99th Percentile Latency (ms)
Table 3: Single-Node OLTP Performance With 193 Users
Understanding Unexpected Results
Initially, the difference between the single-server and HA, dual-server configurations was counterintuitive. So we analyzed network performance by upgrading from a 1x 10 GbE link to 2x 10 GbE links, and then finally to 2x 40 GbE Mellanox® links — with each upgrade having very little impact on application performance. We also examined CPU and 9100 MAX SSD loading, finding that CPU utilization was not being maximized (the mirrored configuration was averaging just 39%), and that the 9100 MAX SSDs were not being taxed. Understanding the average CPU utilization, however, still did not paint a sufficient picture.
To determine what was limiting our HA cluster, we examined CPU loading by core, as shown in Figure 1.
Figure 1: CPU Loading
In particular, note that average CPU utilization was significantly lower on the mirrored configuration with two cores (0 and 23), which showed high utilization. We found that Microsoft SQL Server 2014 has specific threading conditions and limits that directly relate to our use case 3 because:
- Each primary replica uses one log capture thread for each primary database. In addition, it uses one log send thread for each secondary database. Log send threads are released after approximately 15 seconds of inactivity.
- Each secondary replica uses one redo thread for each secondary database. Redo threads are released after approximately 15 seconds of inactivity.
- A backup on a secondary replica holds a thread on the primary replica for the duration of the backup operation.
The Microsoft documentation suggests that we had two threads handling the thread capture and send process from the primary replica to the secondary. The performance of a SQL Server database in an OLTP workload is heavily dependent on the performance of the logging system (known from prior research).
The Bottom Line
OLTP is the backbone workload for many line-of-business applications — from order entry and fulfillment to real-time data acquisition, management and analysis, to large-scale commercial processes. All systems relying on this data need immediate access, with the most uptime possible.
In the past, database (and database storage) administrators wanting to ensure data durability had to suffer with the compromise of a slow, shared storage array or spend hours adjusting configurations to try to reclaim the few IOPS of performance they could find — introducing complexity with little real benefit. Now with the introduction of high-performance NVMe SSDs, like the Micron 9100 MAX, and the simple-to-use data durability features of Microsoft SQL Server 2014, administrators have been freed from traditional, legacy-array burdens.
Using Micron 9100 MAX NVMe SSDs and a pair of simple COTS servers, a dual-node HA Microsoft SQL Server configuration reached 1.8 million OLTP TPM with just 13.1ms average latency. A similarly configured single node (without HA) measured 3.7 million TPM with 6.7ms average latency. These results show that the 9100 MAX SSD can be a cornerstone of high performance OLTP with Microsoft SQL Server — for use in either HA AlwaysOn Availability clusters or in standalone, single-node configurations.