When I reviewed Amazon Aurora last October, I noted that its level of performance was far beyond any I had seen from other open source SQL databases, and it did so at a far lower cost than you would pay for an Oracle database of similar power. Considering that Aurora is a drop-in replacement for the ubiquitous MySQL, Amazon certainly had a winner on its hands.
In an investor call late in 2015, Phil Hardin, director of investor relations at Amazon.com, said the Aurora relational database engine was a technological high point for AWS, and it was “our fastest-growing service ever.” As we’ve long known in tech, pioneers usually wind up with arrows in their backs.
The arrow in Aurora’s back I’ve now evaluated is DeepSQL, another drop-in replacement for MySQL. Deep Information Sciences claims that DeepSQL is faster than Aurora by multiples, depending on the task, and I challenged them to prove it.
How DeepSQL works
DeepSQL uses data structures that are a departure from the b-trees (more or less read-optimized) and log structured merge trees (more or less write-optimized) of classic (circa 1970s) databases. Taking advantage of the evolution in computer architecture over the years (more and faster RAM, more and faster CPUs, faster disks), DeepSQL constantly tries to reach the theoretical minimum disk seek costs: writes should have a seek cost of 0, and reads should have a seek cost of 1. To write without seeking, you can only append to the data log file, then append to the index log files. To read with anything close to one seek, you need to keep indexes and caches and summarizations in memory.
DeepSQL tries to be lockless for reads and writes at the system level, use algorithms that don’t context-switch, and keep its information in logical form that is represented no bigger than it actually is (think compression). To achieve all that — no mean feat — DeepSQL separates its algorithm behavior from its data structures; splits memory and storage into independent structures; introduces kernel scheduling techniques to utilize hardware; introduces a layer to observe and adapt to workloads/resources; uses machine learning to reshape structure and schedule resources; and performs dynamic and continuous online calibration to optimize performance and minimize seeks.
On disk, a DeepSQL data store is little more than a set of value and index log files. The adaptive data structures it creates in memory, however, are a bit different, as shown in the figure below.
The algorithms that manage the engine observe and analyze resources, the workload, and the schema, cardinality, compressions, and distribution of the data. Unsupervised machine learning constantly predicts the optimal resource scheduling and dynamic structure. This continuous online tuning enables maximum resource utilization and adapts dynamically when resources are added or removed, as can happen in a virtualized environment. Deep Information Sciences calls this CASSI, Continuous Adaptive Sequential Summarization of Information.
It took Thomas Hazel, the founder and CTO of Deep Information Sciences, four rather complex diagrams and a half-hour of rather geeky discussion to explain to me how CASSI actually works. I won’t post the diagrams here, as I couldn’t do justice to explaining them, but instead I’ll refer you to the somewhat simpler explanation on the company’s website.
Advantages and disadvantages
There are a number of ways DeepSQL can “shine,” although you won’t really see them if you are running the software on your laptop with a few gigabytes of data. Like Aurora, DeepSQL is best for large data sets running at scale on serious hardware.
One advantage is that DeepSQL can simultaneously run a high transactional load and a high analytic query load — what’s known as Hybrid Transactional/Analytical Processing (HTAP) — without degradation in performance. With DeepSQL you don’t have to move the data to an OLAP database or data warehouse to perform analytics for a high-volume e-commerce site.
Another advantage is that DeepSQL can both ingest data at high speed and handle complicated queries against it using many indexes, the ingest-and-query scenario you tend to see in IoT and bioinformatics. Because of the self-tuning, you don’t need DBAs to optimize the database operation, unlike with MySQL. Take a gander at the figure below:
This is only the first of four pages of tuning parameters for Aurora — although unlike a common or garden MySQL distribution, Aurora comes with good defaults. DeepSQL has one parameter for its engine, the cache size, and you don’t normally touch it: Cache size is adjusted dynamically through CASSI.
That isn’t to say there’s nothing else to touch. The Deep engine coexists with the normal InnoDB and MyISAM engines inside of a MySQL shell. There are times when you will want to configure a table to use the InnoDB engine instead of the Deep engine, primarily because the Deep engine currently lacks support for geographic, full-text, and XA queries, as well as lacking foreign key constraints. How often you’ll use InnoDB will depend on your application.
While there are another half-dozen ways that DeepSQL can claim superiority, I want to mention only one more: fast restarts. In the course of the TPC-H benchmarks that I’ll describe in the next section, it was necessary to stop and restart both DeepSQL and Aurora between queries. We waited more than four minutes for Aurora to restart each time and only a few seconds for DeepSQL to restart.
Benchmark conditions
Deep Information Sciences has been advertising DeepSQL’s advantages over plain MySQL with the InnoDB engine. The goal with this set of benchmarks was to measure DeepSQL’s performance advantages over Aurora for two of the use cases where the Deep engine shines: ingestion followed by a query in the presence of many indexes, and complex business-oriented analytic queries.
Deep Information Sciences proposed, and I approved and supervised, two benchmarks: iiBench, an index/insert benchmark simulating IoT, created by Facebook for random index data distribution; and DBT3 (TPC-H), a decision support benchmark that consists of a suite of business-oriented ad-hoc queries and concurrent data modifications. In addition, iiBench has characteristics of interest to the bioinformatics community, which tends to load a lot of data, then run a complicated query against it.
For iiBench, we specified the default schema and configuration plus three additional indexes, totaling seven indexes; a 16GB cache; and 25 clients making up to 250 million rows. The first phase of iiBench is insert-only with secondary indexing; the second phase, done after a database restart, is a secondary query:
select * from purchases_index limit 10;
select * from purchases_index where productid = <first productid value from prior select>;
For DBT3 (TPC-H), we specified the bog standard benchmark and a 16GB cache.
Since Aurora runs on Amazon only, we picked an Aurora machine size (db.r3.4xlarge) and matched it as closely as possible with an EC2 machine (r3.4xlarge). In both cases, the database had 16 vCPUs, 122GB of RAM, and SSD storage. We also had an m4.4xlarge EC2 client in the same zone driving the benchmarks using the mysql
command line.
Benchmark results
For iiBench ingestion, Aurora took a total of 2.18 times as long as DeepSQL. However, the speed advantage to DeepSQL increased with the total row count.
For the postingestion iiBench query, Aurora took almost 34 minutes to return an answer, and DeepSQL took a little more than 3 minutes, an improvement of more than 10-fold.
For TPC-H, Aurora completed the 22 queries in 1,762 seconds, while DeepSQL took 875 seconds, a twofold speedup. I should note there were glitches on the Aurora side: we had to rerun queries No. 4 and No. 5 to remove excessively long outliers. Neither Deep Information Sciences nor I know why Aurora occasionally takes longer than normal for complex SQL queries, but we didn’t want to make Aurora look bad.
Not every query was a win for DeepSQL, but some were big wins. As you can read in the figures below, DeepSQL completed queries 1, 3, 4, 6, 10, and 11 in about one-fifth of the time that Aurora took, while Aurora was about three times faster for queries 17, 19, and 20. The other queries varied, with DeepSQL taking queries 12, 14, and 22 by about 3.5 times; queries 7, 15, 16, and 18 by 2 to 2.5 times; and queries 2, 5, 8, and 21 by 1.5 times.
Overall, for the two use cases we tested, DeepSQL is the overall winner by at least a factor of two. As you saw in the iiBench trends, DeepSQL’s speed advantage increases as the database becomes larger.
Remember that we bent over backward to make the hardware environments comparable. We also kind of tied one of DeepSQL’s hands behind its back by fixing the value of its cache (to maintain an even playing field) instead of letting it adjust the cache dynamically for optimal performance.
Given that DeepSQL is priced similarly to or less than Aurora for online use and runs on-premise as well as in other clouds, I have to give the nod to DeepSQL based on these benchmarks. I am obligated to point out, however, that benchmarks don’t always reflect real life, however well you try to construct them. If you currently have database loads that benefit from Aurora, it might be worth your while to try them out on DeepSQL to see whether there would be an advantage for you to switch.
What would that experiment cost you? Because DeepSQL is free for development and test purposes, you’d have to download DeepSQL, load it up with your databases on serious hardware, write a few test scripts, and run your typical query patterns. If you run your test on Amazon, you’ll be looking at roughly $1 per hour for EC2 time on the kind of instance we used; my experience is that setting up and running this kind of test takes about a work week, so call it about $120, plus your labor costs. If you run your test on-premises, then assuming that you have servers idle, the cost will be even less.
That sounds like a plan to me. I’d be interested to hear what you find out.
InfoWorld Scorecard | Management (25%) | Performance (25%) | Availability (20%) | Scalability (20%) | Value (10%) | Overall Score (100%) |
---|---|---|---|---|---|---|
DeepSQL 5.6 | 9 | 10 | 10 | 9 | 10 |
By Martin Heller