The past 10 years of data warehousing has been all wrong…

This is an “ideas” post. One where I am trying to work out some ideas that have been bouncing around my head since this morning.

Essentially, the past 10 years of data warehousing have been wrong. Wrong in the sense that the area of data warehousing has not adapted to newer technologies that would solve fundamental data warehousing issues and increase the opportunity for success. Data warehousing projects come in all shapes an sizes. The smaller they are and the more focused the problem they are trying to solve, typically the more successful they are. This is because data warehousing has many non-technical issues that cause it to fail including issues such as : failure of the business to listen & communicate and the failure of IT to listen & communicate, requirements change as fast as the business changes (and that’s pretty fast for many areas such as sales and marketing) as well as sponsorship, sustainable funding and short-term commitment mentalities. Many of these factors are mitigated by smaller projects.

Hence, the application of “lean” and “agile” methodologies to data warehousing. These approaches are really a learning by doing model where you do a small amount of work, receive feedback, do a small amount of work, receive feedback, do a small amount of work….and so forth. Many tiny cycles with feedback help promote alignment between the data warehouse (and its current iteration) and what the business wants or thinks it wants. These approaches have helped but at the trade-off that its difficult to implement very large scale projects across different location models where developers are spread out around the world. So its helped, but large, complex projects must still be conducted and its clear coordinating a large team is just really hard.

Data warehousing technology has not substantially helped solve these problems. Today, larger databases that run very fast are available, but they are built using the old approach e.g. data models, ETL, etc. So those components just run faster. That helps of course because there is less time spent trying to optimize everything and therefore more time spent on other tasks, such as working with the business. But the current use of technology is not really solving lifecycle issues, it actually makes it worse. You have data modeling teams, ETL teams, architect teams, analyst teams–all of which have to piece together their components and have something large work. It is like building a rocket ship without large government funding.

BigData has stepped in and has made available other tools. But they are often applied and targeted at a very specific workflow–a  specific type of analysis–that can be programmed into what are generally fairly immature tools. So BigData is helping because it helps loosen up an architect’s thinking around how to put together solutions as well as employ non-traditional technologies.

So what would help? Let’s consider a world where compute power for each user is effectively infinite. We are not saying its free, just that its relatively easy to get enough compute power to solve specific types of problems. Lets also assume that the non-technical issues will not change its an invariant in this scenario. And lets assume we want to use some elements of technology to address non-technical issues.

In this scenario, we really need a solution that has a few parts to it.

  • We need better tools & technologies that allow us to deliver solutions but deliver solutions under a rapid pace with significantly more updates than even today’s technologies. Lets assume that the word “update” means both the data updates frequently as well as the structure of the data changes frequently.
  • We need to be able to use one environment so that people creating the solutions do not have change the toolset and make diverse toolsets work together. This is one of the reasons why SAS is so popular–you can stay in one toolset.
  • We also need technologies that allow a lifecycle process to work with small teams who combine their solution components together more easily and whenever they are ready–versus when large, team milestones say that components have to be integrated.
  • We need to support processes that span the globe with people who contribute both technical and domain knowledge. We want to support decoupling the teams.

Let’s imagine a solution then. Let’s assume that every piece of data coming into our technology solution is tagged. This one value is tagged as being part of a healthcare claim and represent a diagnosis code. You tag it as being a diagnosis code, as being part of a claim, as being a number, etc. You can describe that relationship. Let’s tag all the data this way. Essentially, you are expanding the context of the data. Now lets assume that we can establish these tags and hence, relationships, between all the data elements and lets also assume that we have a tool that can change these relationships dynamically so that we can create new relationships (pathways) between the data. Of course, ETL conceptually does not go away, but lets assume that ETL becomes more of a process operating at different scales, the data element level, the relationships level, the aggregate of data tag level, etc.

Now, because we have infinite computing resources, we can start assembling the data the way we would like. If we are technologist, perhaps we assemble the way that is helpful for putting together a production report. If we are an analyst, we might assemble it in a way that helps us determine if an outcome measurement improved based on an intervention (which has its own set of tags). When we assemble, we actual describe how data is grouped together to form a hierarchy of concepts. A DX code is a field that belongs to a claim, or a field that belongs to clinical indicators. Indicators are related to procedures through a probabilistic relationship based on past-seen relationships or programmed relationships.

Given that we can assemble and reassemble, let’s also imagine that at any time we can copy all of the data and all the tags. We can go to a master area and just say, I would like to copy it so I can fiddle with my tags and if everyone says they like my tags, I may contribute them back to the master. And lets assume that if the master dataset is updated with recent data, I can just merge those data into my working set. Essentially, we have checked out the entire dataset, track our changes to it, update it with other changes from other people and can check our changes back in for other to use–very much like a data change management solution. As the tags evolve, other people can assemble and reassemble the data in new ways.

So one solution to help fix data warehousing is to employ BigData technology but in a way that allows us to assemble and analyze the data they each individual wants to. And when that individual creates something useful, to share it with others so they can use it. The NoSQL database conceptually give us this capability especially when the data is represented by something as simple as a key+value. Source code control systems like “git” (large scale, distributed management system) give us a model to shoot for but at the data warehouse level and the current crop of ETL programs inform us of the types of changes that need to be made to be data to improve the quality for use.

Many of the ingredients exist today we just need the innovation to happen.

Social media, BigData and Effort-Return

The classic  question we ask about marketing, or really any form of outreach, is given the effort I expend, what is my return. This Effort-Return question is at the heart of ROI, value proposition and the general, down-to-earth question of “Was it worth it?”

That’s the essential question my clients have always asked me and I think that’s a big question that is being formed around the entire area of social media and BigData. Its clear that social media is here to stay. The idea that “people like us” create our own content is very powerful. It gives us voice, it gives us a communication platform and it gives us, essentially, power. The power of the consumer voice is amplified.

Instead of 10 people hearing us when we are upset at something, we can have 1,000,00 hear us. That’s power. And the cost of “hearing” the content, of finding it, is dropping dramatically. It is still not free to access content, you still have enormous search costs (we really need contextual search–searching those resources most relevant to me instead of searching the world). But search costs are dropping, navigation costs are dropping. Every year, those 1,000,000 can listen and filter another 1,000,000 messages.

BigData has come our rescue…in a way. It gives more tools to the programmers who want to shape that search data, who want to help us listen and reach out. There’s alot of hype out there and the technology is moving very fast, so fast, that new projects, new frameworks and new approaches are popping out every day.

But is it worth the effort? If so, just how much is it worth it? That’s still a key question. The amount of innovation in this area is tremendous and its not unlike the innovation I see occurring in the healthcare space. Everyone is trying something which means that everything is being tried somewhere.

That’s good. But it pretty clear already that while we can now communicate in many new channels unavailable 5 years ago, we can communicate easier, more frequently and find things that interest us, does it really pay-off? Do companies benefit by investing now and trying to get ahead or do they just try to keep pace and meet some, but not all, customer expectations. Will entire companies fall because of social media and BigData?

Those are hard questions to answer, but I think we can look at other patterns out there and see that even with all the hype today, it will be worth it. Companies should invest. But many should not over invest. When the internet was first widely used, it was thought that it would change the world. Well, it did. It just too 2 decades to do that instead of the few years that futurists predicted. But with a higher density of connectivity today, innovations can roll out faster.

But I think that’s where the answer is. If you are in the business of social, then being on that wave and pushing the edge is good. If your business is BigData tools and technologies, then yes, you need to invest and recognize that it will be worth it in the long run if you survive. But many brands (companies) can just strive to keep pace and do okay. There are exceptions of course. Product quality and features still dominate purchase decisions. Yes people are moved by viral videos and bad customer service or bad products, but companies with long-standing brands whose products are core, can afford to spend to meet expectations versus having to over invest. They will do just fine keeping pace and continuing to focus on the product as well as marketing. For example, does Coca-Cola expect to double market share because they are better at social media than others? Will they grow significantly because of it? Its not clear but for some segment of products, the spending pattern does not have to be extraordinary. It just needs to keep pace and be reasonable.

This gets us back to the question of social media, BigData and Effort-Return. Effort-Return is important to calculate because brands should not over invest. They need to manage their investments. Is social media and BigData worth the investment? Absolutely, its really a question of degree.

Lean startups

The May issue of HBR has an article on the learn startup model. Essentially, you need to prototype something, find clients quickly, get feedback and iterate again. The idea of “lean” is that you forgo deep planning and marketing that may not make sense since most plans change rapidly anyway.

There is alot of truth in that. I’ve helped startups (even in my early college and grad-school days) and certainly it makes sense to try something and keep iterating. I found that to be true with writing, software, management consulting ideas and a variety of areas.

However, its not universally true. You really do need deep thinking in some cases and some industries or in situations where just getting to the first prototype will consume significant capital. Hence, the idea is a good one, but should be judiciously applied. That’s not to say that getting continuous feedback is ever bad, it’s just that you need more than a prototype.

There is an old management principal around innovation. There is “learning while doing” model that says you cannot know everything or even 1/10th of what you need to know, so its better to get started and learn as you go. That’s the basic concept behind the lean startup (see here for more info on learning-by-doing which is concept from the Toyota system).

The concept is bouncing around the technical crowds as well. This article makes the case that you need to “learn fast” versus “fail fast, fail often” which is in the spirit of the lean startup. In fact, now there are learn canvases that you can put together. While there are alot of good ideas here, I think the only rule about employing them is “pick your rules carefully.”

Hospital profits and trust in the medical community

The Washington Post had an article that covered how hospital profits are increased when there are complications with surgery. I do not think that the health care system causes complications intentionally, but for me this links back to arguments made by Lawerence Lessig.

His argument in his book “Republic Lost” is that the presence of money (profits), in the wrong location (complications related to surgery) causes us to think differently about the relationship between those that provide care and patients (givers and receivers). He believes that the mere presence of money causes us to change our trust relationship with other party.

There does seem to be some evidence of physician-led abuses in the care community. But abusers are more than just providers. An entire ecosystem of characters are at work trying to get a slice of what is an overwhelmingly large slice of money in the US economy.

It is a large pie and so we should expect to have some abuses by all parties involved–including patients! The issue is really about how the presence of money and of stories that discuss it like the above, distorts our trust relationship. According to Lessig, it is this distortion that is decreasing our trust relationship.

Using Lessig’s argument, it is not that we think Providers are needlessly causing complications to obtain more profit, but the presence of money for the wrong incentive causes us to think twice. This is the essence of his argument of “dependency corruption.”

To remove these incentives and restore trust–is the solution an integrated capitated model like Kaiser Permanente? Where Provider & Payer are one and the same and hence, there is a motivation to reduce costs and improve outcomes because those who save dollars get to “cash the check?”

If you believe that this incentive model is the only one that could restore trust, what is the eventual outcome? Could it be that the entire healthcare insurance market will fragment into thousands of small plans perhaps like Canada where there is a central payer and then thousands of healthcare plans to fill in the cracks?

Or is the only way to restore trust to go to a national payer system so that a majority of the healthcare delivered would have an integrated incentive?

Are there any in-between models that work?

Its not clear what will happen but it does seem that trust, as abstract as it sounds, could lead to major structural shifts in the industry just as trust in today’s government seems to be greatly diminished (citizens think that the government is captive to special interest groups and lobbyists who finance their campaigns).

Regardless, I think that smart information management technologies can support that type of fragmentation and still be efficient so we should not let technology limit the best model for healthcare delivery. After all, the new health care law (patient protection act) is attempting to create a nationwide individual market (almost overnight) and the plans must meet minimum standards. They will also include little gap-fillers to fill in the gaps for those that want delta coverage.

We’ll see.

Twitter and the source of real-time news – is it the new emergency broadcast system?

The Washington Post’s Outlook section mentioned that twitter has become the first source of news ahead of the major networks and news organizations. While this is certainly true for many public events, such as the Boston Marathon bombing events, its probably more likely that twitter is the first source of news for certain news area segments. Public events and events where smartphones can operate such as public spaces or urban settings, twitter can be the first to report issues because the witnesses or even the participants can self-report. For other types of news, such as corporate internal news, certain types of business news and government news, finding stories and issues can take time and deeper digging–activities that are not so heavily aligned with twitter’s instant communication model.

Of course, with the power to be real-time comes the responsibility to not use it for fraudulent purposes. Twitter puts significant “power” into the hands of the individual and power can be used for reporting problems and issues but it can also be used to amplify untruths or fraudulent information. All communication channels have this balance to some degree. With Twitter, the amplification effect and balance must be managed much more closely.

Although not everyone has a smartphone or receives twitter alerts many companies and government groups do monitor twitter, in essence, it is now a new socially driven, emergency broadcast system.

Largescale Healthcare sensor networks and BigData

Lately, there have been announcements that could make large-scale, healthcare focused sensor network much more of a reality. A healthcare monitoring network could drive substantial improvements in care and reductions in cost. Today, if you are in a hospital, you are plugged into the sensor network that is relatively stationary and highly controlled (for obvious reasons). But there are many more healthcare, consumer-level networks that could be created. Here’s a mention of the world’s smallest blood monitoring implant and other heart rate monitoring capabilities based on visual monitoring techniques:

Putting together a big data solution here means a solution that can scale out. Batch technologies are not the answer here so frameworks like hadoop directly are not the primary component. Other analytical frameworks like Storm, Dempsy, Apache S4, Esper, OpenMDAO, Stormkeeper or the eclipse m2m framework are needed.

In this case, BigData is about scaling out solutions for sensor networks and piecing together analytical processing nodes to create a workflow that accomplishes the analysis.

But healthcare sensor networks are not without their challenges. Here’s some links that describe the issues in more detail and the research going on in this area