data lakes: old is new and no free lunch, rinse and repeat

I recently watched a few videos from the dremio sponsored data lake conference: https://www.dremio.com/press-releases/introducing-subsurface-the-cloud-data-lake-conference/.

It’s a good collection of videos about a relatively new topic, data lakes. Data lakes are an architectural focal point for data management.

Some people think data lakes are new, especially vendors selling you on data lake tools and consulting. The new hotness is “separating compute and storage,” although that’s been going on for nearly four decades. Even though data lakes are the new hotness, rumors suggest that data lakes are hard to show and deliver ROI. There are many reasons this may be true. We should step back and look at data lakes. Data lakes are nothing new, but their implementations have changed.

Let’s start with a bit of history, around the late 80s and early 90s when data-warehouses roamed the earth. Data warehouses were hot until they weren’t.

Data warehouses were an universal answer to a variety of data management and organizational problems. Today, most people love to make the data warehouses the bogeyman. Data warehouse projects became widow-makers for IT managers. It was always unfair to ask IT managers to smooth over differences in priorities, delivery speeds, and data/analytical needs in the different divisions. Although my point of view is not widespread, after many years helping companies with their analytics, it’s clear that IT is the wrong place to produce a wide range of analytical products consumed by a wide range of users. Budgets for analytics should be borne by those that need it. A few “data products” can be consolidated for cost efficiency into a shared service group like IT. In some cases, if there is a common need or cost control, sure, IT may be an Ok choice where to do these things, but in general, it is not and never will be. That’s just the way business works.

At least in my world, a data warehouse’s inputs and outputs were almost always provided to different data consumers–the data warehouse itself was not the only physical data asset. But this approach and point-of-view was not the standard design approach. Data-warehouses became hard-to-use siloes almost *by definition*. One client hired me to find out why a data-warehouse had no users. The primary user said the IT group turned off his access and did not have the data they needed. Case closed! Many IT managers wanted to control these files to control “one version of the truth,” but it is not efficient to force IT to be the owner of these business issues. You do need one particular place to go for a business measure, but it is not necessarily IT’s job to own and publish it.

By providing inputs and outputs from a data warehouse, a data warehouse became a “cache” of pre-computed values. Whether it was a database table, a cube, or another proprietary data structure, there was always a cache. It is usually too expensive to always recompute a result of raw source data. Storage and compute may be cheap but not free. Caching is not a technical issue. Think economics. The caches are more convenient and less costly to access. Even in a cloud environment, there is a cost to recompute from the raw data. To build a cache, you have to specify what you want before you need it. Even with automatic caching, you need to be thoughtful. And cloud, incremental work is often not capitalizable.

Data virtualization, mostly on-premise, came later in the late 90’s early 2000’s. You could combine data from any source, raw source data, data warehouses, downstream extracts, excel files on your desktop, and query it without having to have prepared the data prior. Of course, to get anything useful, you would have to reproduce many of the same business data processing steps you have to regardless of your data management approach. In some scenarios, this was a huge step forward. The pharmaceutical industry, with vast amounts of unintegrated data, complex formats such as those found in clinical trials, and other domain areas really benefits from this approach. Interestingly enough to get good and fast results, data virtualization tools always had a giant cache in the middle along with a query planner and execution engine.

Enter the cloud and data lakes.

A data lake is a set of inputs and outputs. It is a cache of intermediate computations for some. For others, it is a source of raw information. It usually has data in several multiple formats for tool convenience. It often has some metadata, lineage, and other “management” features to help navigate and understand what is available. Typically, a wide variety of tools are available that work with several, although not infinite, number of data formats. When these types of features are essential to your users, then a data lake makes sense.

Today’s data lake companies are trying to convince you that data warehouses are evil. In many ways, I agree with them because most of them were designed wrong. However, the thinking and effort that goes into a data-warehouse never really goes away. Even in a cloud environment, you still pretty much have to do the same thing as you would build a “thing with a cache in the middle.” At some point, you have to specify what you want to do to the data to make it ready for use. Business intent and processing is inevitable. There is no free lunch.

Fortunately, newer tools, like dremio’s, AWS, Azure, and many others, make this more accessible than before. Most modern tools recognize that there are many formats, access patterns, and data access needs–one size does not fit all. This point of view alone makes these tools better than the traditional “single ETL tool” and “single DW database” approach from the decade prior.

Data lake companies provide tools and patterns that *are* more useful in a highly complex and distributed (organizationally and technically).

Look at dremio.

Dremio has a great product. I like it. It is cast as a data lake engine because data lakes are still kind-of hot in the market. It is really a data virtualization tool well suited for a cloud environment. Highly useful in a situation where you want to provide access to the data in a wide variety of formats and access technologies and tools. Yes, there is a finite list of “connectors.” At least, however, part of dremio, such as apache arrow and arrow flight, is open-source so you can add your own.

dremio has to implement patterns that have been used for decades, even if dremio describes it differently. To make it fast enough and lower costs, it has a cache in the middle, although optional. It has a C++ core, instead of something less efficient, it targets zero-copy transfers through the networking and app stack. It uses code generation to push computation to different locations.

Many, if not most, of these features, were implemented four decades ago for MPP internetworking and were present in Ab-Initio and Torrent data processing products if anyone remembers them. Columnar databases with compression were available three decades ago–I used them. Separate compute and storage, break apart the RDBMS into pieces and retarget them. Check! I’m not saying that everyone is saying these are completely new concepts and have never been done before.

However, newer products like dremio’s are better than yesterday’s tools. Their mindset and development approach is entirely different. Sure, they are not doing anything new architecturally, but that makes them easy to figure out and use. Under the hood, they must build out the same building blocks needed to process data like any product–you cannot escape gravity. They are doing things new design-wise. They are making a better product. Recognizing these basic ideas should help large enterprises adopt and integrate products like dremio.

The sins of data-warehousing and proprietary tools, in general, are many. Most likely, proprietary tools probably still make more money daily than open-source tools. Open-source tools may have higher valuations. Perhaps this reflects their ability to be used by more companies in the long run. Open-source tools are cheaper for the moment. There are more product choices.

In the long run, no market can sustain a large number of products, so when the Fed finally stops supporting companies and capitalism returns, you may see a shrinking of funds around open-source data management tools.

All is not perfect, but it is better than before. Data lakes can be useful because they were useful 20 years ago when they existed at companies but had different names, e.g., the “input layer” or the “extract layer.” Insurance companies loved the “extract” layer because their source systems were many and complex and if you could find the right extract, life was easier. I’m hoping tools like dremio get situated and last in the long run because they are better.

Companies are building non-open parts of their product to monetize and incentivize companies. They still need income. Like previous tools they displaced, these newer tools will be displaced by others unless they get embedded enough at a client or another software company and create a sustainable source of income. Look at Palantir, for example. They have a little open-source, but their core product is behind the firewall. Many of these companies use open-source as a cover for coolness, but their intent is monetized proprietary software. I’m not against that, but we should recognize the situation so we are smarter about our decisions of what to use.

The cycle will continue. Rinse and repeat.