Technology

The Limitations of On-Demand Warehouses

I recently spent some time digging into the major cloud-based data warehouse offerings to understand them better. I was specifically motivated by accounts from fairly sophisticated technology professionals about how the costs of these systems had become problematic for their organizations and they were trying to understand what, if any, options they might have to address this.

It was surprising (shocking really) to me to learn that the underlying issues were almost entirely driven by the application of these data warehouses to use cases for which they are really not a fit. This happens because users don’t have any clear alternatives most of the time. In fact, many seem not to even notice when the usage of their data warehouse shifts over time to the point where it is probably just not the right architecture anymore. 

I believe that the value of the data comes from using it – a lot. Queries, analysis, training models, anomaly detection, reports, alerts: the more you use the data the better your chances are of discovering new insights. On-demand data warehouses were not built for this, but there are now better architectures that are.

How Did We Get Here?

Understanding the greatness (and shortcomings) of the engineering that went into these cloud-based warehouse architectures requires understanding where we came from. For something like 30 years (1970-2000), all application development relied pretty much exclusively on databases for their primary storage. Databases are a great tool for operational/transactional data and in the age of client/server applications you probably didn’t need anything else. With the advent of the Internet, the growth of each application’s potential reach suddenly meant that it was both possible and valuable to collect data at volumes that had really never been seen before. Logs, events, heartbeats, sensor data, biometrics: we began collecting many kinds of data that get big quickly.

This data cannot be thrown into a database (in this case specifically a row store). The underlying architecture of row stores starts to break down at sizes over 100 million rows in an unpartitioned table requiring progressively herculean efforts to scale much beyond that. Even when possible, the cost of this kind of solution is enormous – and out of reach for most organizations. Part of the solution was the development of columnar databases that can handle these data volumes, but the entire concept of the database-as-a-monolith meant that the hardware to handle your peak workload was probably very expensive, and idle most of the time.

The solution to this was to develop cloud-native architectures that separate storage from compute. This allows the underlying dataset to be stored in cheap, flat, file-like storage (such as AWS s3) while only paying for compute in bursts when the system is actually being used. This allowed users to do something with these large datasets without having to buy an appliance capable of executing fast queries 24/7 at prohibitive expense. These cloud-based data warehouse architectures laid the groundwork for most systems in use today. 

The Players

There are four major players in this space and I find it fascinating that they have developed solutions that are architecturally surprisingly different from each other.

Snowflake developed an architecture that is the closest to a traditional (columnar) database but with massive optimizations centered around the fact that the data is remote. Using highly optimized retrieval and pre-fetching of data, Snowflake often approximates the performance of a data warehouse with local data. 

Redshift (AWS) is another columnar database, in this case based on PostgreSQL, taking a similar approach to “porting” a data warehouse to the cloud. Redshift tends to be more competitive on price, but lacks some of the optimizations implemented by Snowflake, leading to rough edges in worst-case performance and latency, especially as datasets scale. 

BigQuery (Google) is not an indexed data store, but a query engine that uses massively parallel compute to scan partitioned columnar data efficiently. 

Databricks uses Apache Spark under the covers to process parquet files at query time. This approach is conceptually similar to BigQuery but relies on more local processing leading to longer query startup latency in many cases. 

Note that Snowflake and Redshift provide traditional database indices for optimized queries while BigQuery and Databricks do not. This typically leads to higher per-query cost for the unindexed data stores.

The Pain of Data Warehouse Pricing

Using these systems in practice means figuring out what they will cost, which often turns out to be appallingly difficult. All of them brag about their costs for storing data at rest, which are usually pretty transparent. But once you start using them to execute queries, it can be difficult or impossible to know what any given workload is going to cost before you do it.

BigQuery has by far the simplest pricing model so I’m going to use them to walk through a few examples. It costs $6.25/TiB queried (scanned really) to execute a query. Since BigQuery can be partitioned, you can get efficiencies by partitioning along common query dimensions that will reduce your total query cost. For example, 1 TB (I’m going to ignore the distinction between TiB and TB for now) that is loaded naively (i.e. unpartitioned) would cost $6.25 to query as would 10 TB of data partitioned into 10 equal-sized buckets (for a dimension included in the query) as would 1000 TB of data partitioned into 1000 equal-sized buckets. This makes it appear that you can get massive cost savings by anticipating the right partitioning dimensions, but in practice this turns out not to be true for most interesting scenarios.

Unfortunately, you can only gain these efficiencies when you know what queries you will run in advance. By definition exploratory scenarios, i.e. “I have no idea what I am looking for, but I bet if I look long enough I will find something interesting”, cannot be anticipated or optimized for at design time. This is one area where these data warehouses, and really almost all data platforms, break down.

Just to make this distinction as clear as possible: imagine a scenario where you are executing 10 million queries across a 1 TB dataset. Further imagine that you have partitioned things so efficiently that the effective cost of a query using your partition key is zero. But some percentage, let’s say 50%, of your queries don’t hit those partition keys. In that world you have reduced a $62.5M cost of executing those queries naively to a $31.25M cost for executing the unpartitioned queries. The unpartitioned queries completely dominate your costs which means that, unless your usage of unanticipated queries is ~0% it will get expensive very quickly. Datastores that also provide indexing do better (they have more dimensions of optimization) but the point remains roughly the same.

Just to repeat: that workload, which is a very realistic potential real-world scenario, would cost $30 million to execute on BigQuery. Obviously you are never going to do that unless you are discovering the location of a gold mine or solving for the meaning of life. This is the thing about these cloud-based data warehouses: they reduce the cost at rest but this comes with a significant hit to the query execution cost.

In other words, they deliver the opposite of what many companies need today.

The Value Comes from Using the Data

At the end of the day the value of your data comes from using it, which in most cases means looking at it a lot. Insights found and lessons learned will be correlated with the total number of queries you have executed. The rate at which you find those insights will be correlated with the number of queries you execute per second. It doesn’t matter how much data you have stored if you can’t use it; and it doesn’t matter that you can use it if it takes so long that you will be dead before you learn anything.

Thinking about it this way, it becomes obvious that as you use your data more the cost of executing queries against it will completely dominate the total cost, and is therefore the term you need to optimize for. These data warehouses, which were built to optimize the cost at rest at the expense of query execution, are exactly the wrong architecture as you begin to query your data frequently (or continuously).

If you started off using a data warehouse as a place to store your data and generate reports and then found yourself using it a bit more, and a bit more and are now looking at a monthly bill that is becoming unsustainable, you are not alone! This is just a fact of life with cloud-based data warehouses. Like any good booster rocket, yours may have outlived its usefulness.

Using the Data Requires a Different Kind of Architecture

So: where do you go from here? Let’s start by looking at what a query-optimized architecture would look like. First, you definitely don’t want to separate storage and compute. This allowed warehouses to reduce storage costs at the expense of latency and cost at query execution time. This is the opposite of what we want if we are going to execute millions (billions?) of queries efficiently.

Second, you want to avoid monolithic architectures. Tuning a system to have the right proportion of CPU/RAM/Disk and then scaling horizontally will get better returns for executing many queries over very large datasets than trying to get a bigger box with enough resources to run your queries in one place. It is probably impossible to tune the performance of such a system well, and it will hit a ceiling when you have provisioned the largest server you have access to.

The final piece of the puzzle is to focus on minimizing the cost per query. Heavily indexed data structures let you leverage disk to do the bulk of the work rather than highly parallel systems of worker threads that leverage compute. Given the several-orders-of-magnitude difference between the cost of disk and CPU (with associated RAM) moving as much of the workload to disk as possible is always going to be the winning move.That suggests very highly indexed data storage which is typically provided by a few NoSQL solutions or search-based data storage. In any case if you index everything, your NoSQL solution will start to resemble search regardless.

Query-driven Architecture

The kind of architecture we are talking about addresses many of the core concerns we discussed earlier. Exploratory analytics, in which you are executing ad hoc queries with no knowledge in advance of which fields will be used, are no problem on a highly indexed search cluster. Queries will return in approximately the same time regardless of the fields queried, unlike the solutions used by data warehouses that eventually rely on pre-configured partitions to guarantee performance. Not only will you not be able to anticipate all of the dimensions of your queries in your partitioning structure, some data types do not lend themselves to partitioning at all. A good example of this is text, which can simply not be used as a partition key in most architectures but that is necessary for most interesting real world systems.

Moving to a query-driven architecture can drive the cost per query down to 1% (or less!) of what it would take to execute using an on-demand data warehouse. This enables a number of powerful use cases:

  • Near real-time analysis: important for applications as diverse as live advertising campaign performance and feedback-based machine learning 

  • System synchronization and federation of large data systems: useful for data co-ops and real-world defense scenarios

  • High user concurrency: required for social networks, gaming and collaborative content generation

  • High-frequency, low-latency updates: allows session-based analytics and live risk assessment for many domains

  • Change-based alerts: enables performance analytics and threat detection

  • Many more

These are capabilities that simply cannot be delivered using a data warehouse-based architecture.

Query-driven Architecture with MinusOneDB

We believe that architectures supporting very high query throughput on large datasets are the next generation of data systems.

To enable this we developed a unique storage platform using distributed search as its primary storage and we index everything. This provides true horizontal scalability and maximally tuned query performance leveraging disk-based indexes for all fields, including text. While this approach makes writes more expensive (due to the manyindex updates required), this is almost always the correct tradeoff as massive write throughput tends to be an occasional operation (e.g. complete data store rebuild) rather than the major daily workload of the cluster. 

Using our implementation of this architecture we’ve demonstrated it is possible to provide constant time reads and writes at any scale for linear cost. This is a big leap. The cost reductions and speed improvements are order-of-magnitude level, which make this approach a strategic breakthrough that has business implications for a very wide swath of organizations and use cases.

As fast as this approach is out of the box - and most queries complete in under a second - we can “overclock” distributed search architecture as well for further speed improvements. 

We also have the ability to optimize the time cost of write operations by applying significantly increased CPU for batch write workloads to minimize wall clock time. Essentially, we have reversed the choices made by on-demand data warehouses: we pay for CPU at write/build time in order to minimize cost – and maximize performance – at query time.

Horizontal architectures like this pose their own operational challenges: balancing workloads while maintaining reliability and enterprise-grade stability in a data store founded on distributed search is a unique challenge. Managing these systems is also complex as there are many moving parts that need to coordinate. We have invested person decades developing and tuning systems like these in order to be able to deliver the underlying infrastructure needed for these capabilities with the simplicity of a monolithic architecture. 

As complex as the challenges behind the scenes are, the experience for the developer is simple – a single “device” that provides massive horizontal scalability and cost optimization for a query-based workload.

When Should You Make the Switch?

Every organization has its own challenges depending on what stage it is at with collecting, organizing and leveraging data to power its business. For many organizations using a data warehouse allowed them to begin making their data usable at scale for the first time. Reaching this stage is transformative as the fog of war begins to clear, possibly for the first time. However, as the organization collects richer and broader datasets, and as it becomes clear that there are insights to be discovered among the unknown unknowns, query volume and usage will increase. That is the time to move to a query-driven architecture to allow you to leverage the data that you have painstakingly collected to the fullest extent.

None of this is easy; we know, we’ve been there! At the end of the day we believe that making people smarter is the best way to make the world better. Right now too many people are spending too much time fighting with data infrastructure instead of doing what they do best. We’re going to help fix that. Reach out to me at willliam@minusonedb.com and try out a query-driven architecture here.


Author

William Wechtenhiser

Jul 27, 2024

Latest Blog Posts