The past 10 years of data warehousing has been all wrong…

This is an “ideas” post. One where I am trying to work out some ideas that have been bouncing around my head since this morning.

Essentially, the past 10 years of data warehousing have been wrong. Wrong in the sense that the area of data warehousing has not adapted to newer technologies that would solve fundamental data warehousing issues and increase the opportunity for success. Data warehousing projects come in all shapes an sizes. The smaller they are and the more focused the problem they are trying to solve, typically the more successful they are. This is because data warehousing has many non-technical issues that cause it to fail including issues such as : failure of the business to listen & communicate and the failure of IT to listen & communicate, requirements change as fast as the business changes (and that’s pretty fast for many areas such as sales and marketing) as well as sponsorship, sustainable funding and short-term commitment mentalities. Many of these factors are mitigated by smaller projects.

Hence, the application of “lean” and “agile” methodologies to data warehousing. These approaches are really a learning by doing model where you do a small amount of work, receive feedback, do a small amount of work, receive feedback, do a small amount of work….and so forth. Many tiny cycles with feedback help promote alignment between the data warehouse (and its current iteration) and what the business wants or thinks it wants. These approaches have helped but at the trade-off that its difficult to implement very large scale projects across different location models where developers are spread out around the world. So its helped, but large, complex projects must still be conducted and its clear coordinating a large team is just really hard.

Data warehousing technology has not substantially helped solve these problems. Today, larger databases that run very fast are available, but they are built using the old approach e.g. data models, ETL, etc. So those components just run faster. That helps of course because there is less time spent trying to optimize everything and therefore more time spent on other tasks, such as working with the business. But the current use of technology is not really solving lifecycle issues, it actually makes it worse. You have data modeling teams, ETL teams, architect teams, analyst teams–all of which have to piece together their components and have something large work. It is like building a rocket ship without large government funding.

BigData has stepped in and has made available other tools. But they are often applied and targeted at a very specific workflow–a  specific type of analysis–that can be programmed into what are generally fairly immature tools. So BigData is helping because it helps loosen up an architect’s thinking around how to put together solutions as well as employ non-traditional technologies.

So what would help? Let’s consider a world where compute power for each user is effectively infinite. We are not saying its free, just that its relatively easy to get enough compute power to solve specific types of problems. Lets also assume that the non-technical issues will not change its an invariant in this scenario. And lets assume we want to use some elements of technology to address non-technical issues.

In this scenario, we really need a solution that has a few parts to it.

  • We need better tools & technologies that allow us to deliver solutions but deliver solutions under a rapid pace with significantly more updates than even today’s technologies. Lets assume that the word “update” means both the data updates frequently as well as the structure of the data changes frequently.
  • We need to be able to use one environment so that people creating the solutions do not have change the toolset and make diverse toolsets work together. This is one of the reasons why SAS is so popular–you can stay in one toolset.
  • We also need technologies that allow a lifecycle process to work with small teams who combine their solution components together more easily and whenever they are ready–versus when large, team milestones say that components have to be integrated.
  • We need to support processes that span the globe with people who contribute both technical and domain knowledge. We want to support decoupling the teams.

Let’s imagine a solution then. Let’s assume that every piece of data coming into our technology solution is tagged. This one value is tagged as being part of a healthcare claim and represent a diagnosis code. You tag it as being a diagnosis code, as being part of a claim, as being a number, etc. You can describe that relationship. Let’s tag all the data this way. Essentially, you are expanding the context of the data. Now lets assume that we can establish these tags and hence, relationships, between all the data elements and lets also assume that we have a tool that can change these relationships dynamically so that we can create new relationships (pathways) between the data. Of course, ETL conceptually does not go away, but lets assume that ETL becomes more of a process operating at different scales, the data element level, the relationships level, the aggregate of data tag level, etc.

Now, because we have infinite computing resources, we can start assembling the data the way we would like. If we are technologist, perhaps we assemble the way that is helpful for putting together a production report. If we are an analyst, we might assemble it in a way that helps us determine if an outcome measurement improved based on an intervention (which has its own set of tags). When we assemble, we actual describe how data is grouped together to form a hierarchy of concepts. A DX code is a field that belongs to a claim, or a field that belongs to clinical indicators. Indicators are related to procedures through a probabilistic relationship based on past-seen relationships or programmed relationships.

Given that we can assemble and reassemble, let’s also imagine that at any time we can copy all of the data and all the tags. We can go to a master area and just say, I would like to copy it so I can fiddle with my tags and if everyone says they like my tags, I may contribute them back to the master. And lets assume that if the master dataset is updated with recent data, I can just merge those data into my working set. Essentially, we have checked out the entire dataset, track our changes to it, update it with other changes from other people and can check our changes back in for other to use–very much like a data change management solution. As the tags evolve, other people can assemble and reassemble the data in new ways.

So one solution to help fix data warehousing is to employ BigData technology but in a way that allows us to assemble and analyze the data they each individual wants to. And when that individual creates something useful, to share it with others so they can use it. The NoSQL database conceptually give us this capability especially when the data is represented by something as simple as a key+value. Source code control systems like “git” (large scale, distributed management system) give us a model to shoot for but at the data warehouse level and the current crop of ETL programs inform us of the types of changes that need to be made to be data to improve the quality for use.

Many of the ingredients exist today we just need the innovation to happen.

Leave a Reply