yes, yet another bigdata summary post…now it’s a party

Since I am “recovering” data scientist, I thought that once in awhile, it would be good deviate from my more management consulting articles and  eyeball the bigdata landscape to see if something interesting has happened.

What!?! It seems like you cannot read an article without encountering yet another treatise on bigdata or at the very least, descriptions of the “internet of things.”

That’s true, but if you look under the hood, the most important benefits of the bigdata revolution have really been on two fronts. First, recent bigdata technologies have decreased the cost of analytics and this makes analytics more easily available to smaller companies. Second, the bigadata bandwagon has increased awareness that analytics are needed to run the business.  Large companies could long afford the investments in analytics which made corporate size an important competitive attribute. The benefits from analytics should not lead to a blanket and unthoughtful endorsement of analytics. Not every business process, product or  channel needs overwhelming analytics. You want, however, analytics to be part of the standard toolkit for managing the value chain process and decision making.

The ability to process large amounts of data, beyond what mainframes could do, has been with us for years-twenty to thirty years The algorithms developed decades ago are similar to the algorithms and processing schemes pushed in the bigdata world today. Teradata helped created the MPP database and SQL world. AbInitio (still available) and Torrent (with their Orchestrate product sold to IBM eventually) defined the pipeline parallelism and data parallelism data processing toolchain world. Many of the engineers at these two ETL companies came from Thinking Machines. The MPI API defined parallel processing for the scientific world (and before that PVM and before that…).

All of these technologies were available decades ago. Mapreduce is really an old lisp concept of map and fold which was available in parallel from Thinking Machines even earlier. Today’s tools build on the paradigms that these companies created in the first pass of commercialization. As you would expect, these companies built on what had occurred before them. For example, parallel filesystems have been around for a long time and were present on day one in those processing tools mentioned above.

Now that the hype around mapreduce is declining and its limitations are finally becoming widely understood,  people recognize that mapreduce is just one of several parallel processing approaches. Free from the mapreduce-like thinking, bigdata toolchains can finally get down to business. The bigdata toolchains realize that sql query expressions are a good way to express computations. Sql query capabilities are solidly available in most bigdata environments. Technically, many of the bigdata tools provide “manual” infrastructure to build the equivalent sql commands. That is, they provide the parsing, planning and distribution of the queries to independent processing nodes.

I consider the current bigdata “spin” that started a about 1-2 years ago healthy because it  increased the value of other processing schemes such as streaming, real-time query interaction and graphs. To accommodate these processing approaches, the bigdata toolchains have changed significantly. Think SIMD, MIMD, SIPD and all the different variations.

I think the framework developers have realized that these other processing approaches require a general purpose parallel execution engine. An engine that AbInitio and others have had for decades. You need to be able to execute programs using a variety of processing algorithms where you think of the “nodes” as running different types of computations and not just a single mapreduce job. You need general purpose pipeline and data parallelism.

We see this in the following open-source’ish projects:

  • Hadoop now as a real resource and job management subsystem that is a more general parallel job scheduling tool. It is now useful for more genera parallel programming.
  • Apache Tez helps you build general jobs (for hadoop).
  • Apache Flink builds pipeline and data parallel jobs. Its also a general purpose engine e.g. streaming, …
  • Apache Spark builds pipeline and data parallel jobs. Its also a general purpose engine e.g. streaming, ..
  • Apache Cascading/Scalding builds pipeline and data parallel jobs, etc.
  • DataTorrent: streaming and more.
  • Storm: Streaming
  • Kafka: Messaging (with persistency)
  • Scrunch: Based on apache crunch, builds processing pipelines
  • …many of the above available as PaaS on AWS or Azure…

I skipped many others of course and I am completely skipping some of the early sql-ish systems such as hive and I have skipped  visualization, which I’ll hit in another article. Some of these have been around for a few years in various stages of maturity. Most of these implement pipeline parallelism and data parallelism for creating general processing graphs and some provide sql support where that processing approach makes sense.

In addition the underlying engines, what’s new? I think some very important elements: usability. The tools are a heck-of-alot easier to use now. Here’s why.

What made the early-stage (20-30 years ago) parallel processing tools easier to use was that they recognized, due to their experience in the parallel world, that usability by programmers was key. While it is actually fairly easy to get inexpensive scientific and programming talent, programming parallel systems has always been hard. It needs to be easier.

New languages are always being created to help make parallel programming easier. Long ago, HPF and C* among many were commercial variations of the same idea.  Programmers today want to stay within their toolchains because switching toolchains to run a data workflow is hard work and time consuming to develop. Many of today’s bigdata tools allow multiple languages to be used: Java, Python, R, Scala, javascript and more. The raw mapreduce system was very difficult to program and so user-facing interfaces were provided, for example, cascading. Usability is one of the reasons that SAS is so important to the industry. It is also why Microsoft’s research Dryad project was popular. Despite SAS’s quirks, its alot easier to use than many other environments and its more accessible to the users who need to create the analytics.

In the original toolsets from the vendors mentioned earlier in this article, you would program in C++ or a special purpose data management language. It worked fine for those companies who could afford the talent that could master that model. In contrast to today, you can use languages like python or scala to run the workflows and use the language itself to express the computations. The language itself is expressive enough  that you are not using the programming environment as a “library” that you make programming calls to. The language constructs are  translated into the parallel constructs transparently. The newer languages, like lisp of yore, are more functionally oriented. Functional programming languages come with a variety of capabilities that makes this possible. This was the prize that HPF and C* were trying to win. Specialized languages are still being developed that help specify parallelism and data locality without being “embedded” in other modern languages and they to can make it easier to use the new bigdata capabilities.

While the runtimes of these embedded parallel capabilities are still fairly immature in a variety of ways. Using embedded expressions, data scientists can use familiar toolchains, languages and other components to create their analytical workflows easier. Since the new runtimes allow more than just mapreduce, streaming, machine learning and other data mining approaches suddenly becomes much more accessible at large scale in more ways than just using other tools like R and others.

This is actually extremely important. Today’s compute infrastructure should not be built with rigid assumptions about tools, but be “floatable” to new environments where the pace of innovation is strong. New execution engines are being deployed at a fantastic rate and you want to be able to use them to obtain processing advantages. You can only do that if you are using well known tools and technologies and if you have engineered your data (through data governance) to be portable to these environments that often live in the cloud. It is through this approach that you can obtain flexibility.

I won’t provide any examples here, but lookup the web pages for storm and flink for examples. Since sql-like query engines are now available in these environments, this also contributes to the user-friendliness.

Three critical elements are now in play: cost effectiveness, usability and generalness.

Now its a party.

Do sanctions work? Not sure, but they will keep getting more complex

After Russia and Ukraine ran into some issues a few months back, the US gathered international support and imposed sanctions.

Most people think that sanctions sound like a good idea. But do they work?

Whether sanctions work is a deeply controversial topic. You can view sanctions through many different lenses. I will not be able to answer that question in this blog. It is interesting to note that the sanctions against Russia over the Ukraine situation are some of the most complex in history. I think the trend will continue. Here’s why.

Previously, sanctions would be imposed on a country that is doing things the sanctioning entity does not want to happen. Country-wide sanctions are fairly easy to understand and implement. For example, sanctions against Iran for nuclear enrichment. Sanctions in the past could be levelled at an entire country or a category of trade e.g. steel or high performance computers. But they have to be balanced. In the case of Russian and Ukraine, the EU obtains significant amounts of energy from Russia.  Sanctions against the energy sector would hurt both the EU and Russia.

Sanctions today often go against individuals. The central idea is to target individuals who have money at stake. OFAC publishes a list of sanctioned individuals and updates in regularly. If you are on the list, you are not allowed to do business with those sanctioned individuals, that is, you should not conduct financial transactions of any type with that individual (or company).

The new Russian sanctions target certain individuals, a few Russian banks (not all of them), and allows certain forms of transactions. For example, you cannot transact with a loan or debenture longer than 90 days maturity or new issues. Instead of blanket sanctions, its a combination of attributes that apply as to whether a financial transaction can be made.

Why are the Russian sanctions not a blanket “no business” set of sanctions?

By carefully targeting (think targeted marketing) the influences of national policy, the sanctions would hurt the average citizen a bit less, perhaps biting them, but no so much that the average citizen turns against the sanctioning entity. Biting into the influencers and others at the top is part of a newer model of making individuals feel the pain. This approach is being used the anti-money laundering (AML) and regulatory space in the US in order to drive change in the financial services industry e.g. hold a chief compliance officer accountable if a bad AML situation develops.

So given the philosophical change as well as the new information-based tools that allow governments to be more targeted they will keep getting more complex.

Oso Mudslides and BigData

There was much todo about google’s bigdata bad flu forecasts recently in the news. google had tried to forecast flu rates in the US based on search data. That’s a hard issue  to forecast well but doing better will have public benefits by giving public officials and others information to identify pro-active actions.

Lets also think about other places where bigdata, in a non-corporate, non-figure-out-what-customers-will-buy-next way, could also help.

Let’s think about Oso, Washington (Oso landslide area on google maps)

Given my background in geophysics (and a bit of geology), you can look at Oslo, Washington and think…yeah…that was a candidate for a mudslide. Using google earth, its easy to look at the pictures and see the line in the forest where the earth has given way over the years. It looks like the geology of the area is mostly sand and it was mentioned it was glacier related. All this makes sense.

We also know that homeowner’s insurance tries to estimate the risk of a policy before its issued and its safe to assume that the policies either did not cover mudslides or catastrophes of this nature for exactly this reason.

All of this is good hind-sight. How do we do better?

Its pretty clear from the aerial photography that the land across the river was ripe for a slide. The think sandy line, the sparse vegetation and other visual aspects from google earth/maps shows that detail. Its a classic geological situation. I’ll also bet the lithography of the area is sand, alot of sand, and more sand possible on top of hard rock at the base.

So lets propose that bigdata should help give homeowners a risk assessment of their house which they can monitor over time and use to evaluate the potential devastation that could come from a future house purchase. Insurance costs alone should not prevent homeowners from assessing their risks. Even “alerts” from local government officials sometimes fall on deaf ears.

Here’s the setup:

  • Use google earth maps to interpret the images along rivers, lakes and ocean fronts
  • Use geological studies. Its little known that universities and the government have conducted extensive studies in most areas of the US and we could, in theory, make that information more accessible and usable
  • Use aerial photography analysis to evaluate vegetation density and surface features
  • Use land data to understand the terrain e.g. gradients and funnels
  • Align the data with fault lines, historical analysis of events and other factors.
  • Calculate risk scores for each home or identify homes in an area of heightened risk.

Do this and repeat monthly for every home in the US at risk and create a report for homeowners to read.

Now that would be bigdata in action!

This is a really hard problem to solve but if the bigdata “industry” wants to prove that its good at data fusion on a really hard problem that mixes an extremely complex and large amount of disparate data and has public benefit, this would be it.

Tempering our expectations for bigdata in healthcare

Expectations around bigdata’s impact on healthcare is leaping ahead of reality and some good thoughts are being expressed. However, healthcare has already had significant amounts of analytics applied to it. The issue is not that larger sets of data are critical, but that the sharing and integration of the data are the critical parts for better analysis. Bigdata does not necessarily solve these problems although the bigdata fever may help smash through these barriers. Over 15 Blues and most of the major nationals have already purchased data warehouse appliances and advanced systems to speed-up analysis, so its not necessarily performance or scalability that is constraining advances built on data-driven approaches. And just using unstructured text in analytics will not create a leapfrog in better outcomes from data.

We really need to think integration and access. More people performing analysis in clever ways will make a difference. And this means more people than just the few that can access healthcare detailed data: most of which is proprietary and will stay proprietary to companies that collect it. Privacy and other issues prevent widespread sharing of the granular data needed to truly perform analysis and get great results…its a journey.

This makes the PCORI announcements about yet another national data infrastructure (based on a distributed data model concept) and Obama’s directive to get more Medicare data into the world for innovation (see the 2013 Healthcare Datapooloza that just completed in Washington DC) that much more interesting. PCORI is really building a closed network of detailed data using a common data model and distributed analysis while CMS is being pushed to make datasets more available to entrepreneurs and innovators–a bit of the opposite in terms of “access.”

There are innovative ideas out there, in fact, there is no end to them. Bigdata is actually a set of fairly old ideas that are suddently becoming economic to implement. And there is serious lack of useful datasets that are widely available. The CMS datasets are often heavily massaged prior to release in order to conform to HIPAA rules e.g. you cannot provide detailed data at an individual level essentially despite what you think you are getting: just stripping off a name and address off a claim form is sufficient for satisfying HIPAA rules.

So its clear that to get great results, you probably have to follow the PCORI model, but then analysis is really restricted to just a few people who can access those datasets.

That’s not to say that if patients are willing to opt-in to programs that get their healthcare data out there, bigdata does not have alot to offer. Companies using bigdata technology on their proprietary datasets can make a difference and there are many useful ideas to economically go after using bigdata–many of which are fairly obvious and easy to prioritize. But there is not going to suddenly be a large community of people with new access to granular data that could be, and often is, the source of innovation. Let’s face it. Many healthcare companies have had advanced analytics and effectively no real budget constraints for many years and will continue to do so.  So the reason that analytics have not been created deployed more than today is unrelated to technology.

If bigdata hype can help executives get moving and actually innovate (its difficult for executives to innovate versus just react in healthcare) then that’s a good thing and getting momentum will most likely be the largest stimulus to innovation overall. That’s why change management is key when using analytics for healthcare.

Anti-Money Laundering (AML) and Combating Terrorist Funding (CTF) analytics review

In my last blog I reviewed some recent patents in the AML/CTF space. They describe what I consider some very rudimentary analytics workflows–fairly simple scoring and weighting using various a-priori measures. Why are such simple approaches patentable? To give you sense of why I would ask this question, there was a great trumpeting of news around the closing of a $6b money laundering operation at Liberty Reserve. But money laundering (including terrorism funding) is estimated at $500 billion to $1 trillion per year. That’s alot of badness that needs to be stopped. Hopefully smarter is better.

There are predictive analytical solutions to various parts of the AML problem and there is a movement away from rules-only systems (rules are here to stay however since policies must still be applied to predictive results). However, the use of predictive analytics  is slowed because AML analytics boils down to an unsupervised learning problem. Real-world test cases are hard to find (or create!) and the data is exceptional noisy and incomplete. The short message is that its a really hard problem to solve and sometimes simpler approaches just work easier than others. However, in this note, I’ll describe the issues a bit more and talk about where more advanced analytics have come into play. Oh and do not forget, on the other side of the law, criminals are actively and cleverly trying to hide their activity and they know how banks operate.

The use of algorithms for AML analytics is advancing. Since AML analytics can occur at two different levels, the network and the individual level, its pretty clear that graph theory and other techniques that operate on the data in various ways are applicable. AML Analytics is not simply about a prediction that a particular transaction, legal entity or gorup of LE’s are conducting money laundering operations.  Its best to view AML analytics as a collection of techniques from probabilistic matching to graph theory to predictive analytics combining together to identify suspicious transactions or LEs.

If the state AML analytics is relatively maturing, what is the current state? Rather simple actually. Previous systems, including home grown systems, focused on the case management and reporting aspects (that’s reporting as in reporting on the data to help an analyst analyze some flows as well as regulatory reporting). AML Analytics was also typically based on sampling!

Today, bigdata can help to avoid sampling issues. But current investments are focused around the data management aspects because poor data management capabilities have greatly exacerbated the cost of implementing AML solutions. FS institutions desperately need to reduce these costs and comply with what will be an ever-changing area of regulation. “First things first” seems to be the general thrust around AML investments.

Since AML analysis will be based on Legal Entities (people and companies) as well as products, its pretty clear that the unique identification of LEs and the hierarchy/taxonomies/classifications of financial instruments is an important data management capability. Results from AML Analytics can be greatly reduced if the core data is noisy. When you combine the noisy data problem with today’s reality of highly siloed data systems inside of Banks and FS institutions, the scope of trying to implement AML Analytics is quite daunting. Of course, start simple and grow it.

I mentioned above that there are not alot of identifiable cases for training algorithms. While it is possible to flag some transactions and confirm them, companies must report Suspicious Activity Reports (SAR) to the government. Unfortunately, the government does not provide a list of “identified” data back. So it is difficult to formulate a solution using supervised learning approaches. That’s why it is also important to attack the problem from multiple analytical approaches–no one method dominates and you need multiple angles of attack to help tune your false positive rates and manage your workload.

When we look at the underlying data, its important to look at not only the data but also the business rules currently (or proposed) in use. The business rules will help identify how the data is to be used per the policies set by the Compliance Officer. The rules also help orient you on the objectives of the AML program at a specific institution. Since not all institutions transact all types of financial products, the “objectives” of an AML system can be very different. Since the objectives are different, the set of analytics used are also different. For example, smaller companies may wish to use highly iterative what-if scenario analysis to refine the policies/false positive rates by adjusting parameters and thresholds (which feels very univariate). Larger banks need more sophisticated analysis based on more advanced techniques (very multi-variate).

We’ve mentioned rules (a-priori knowledge, etc.) and predictive/data mining models (of all kinds since you can test deviations from peer groups using data mining methods, and predicted versus actual patterns etc.) and graph theory (link analysis). We’ve also mentioned master data management for LEs (don’t forget identity theft) and products as well taxonomies, classifications and ontologies. But we also cannot forget time series analysis for analyzing sequential events. That’s a good bag of data mining tricks to draw from. The list is much longer. I am often reminded of a really great statistics paper called Bump Hunting in High Dimensional Data by Jerome Friedman and Nick Fisher because that’s conceptually what we are really doing. Naturally, criminals wish to hide their bumps and make their transactions look like normal data.

On the data site, we have mentioned a variety of data types. The list below is a good first cut but you also need to recognize that synthesizing data, such as from aggregations (both time based aggregations and LE based aggregations such as transaction->account->person LE->group LE), are also important for the types of analytics mentioned above:

  • LE data (Know Your Customer – KYC)
  • General Ledger
  • Detailed Transaction data
  • Product Data
  • External sources: watch lists, passport lists, identity lists
  • Supplemental: Reference data, classifications, hierarchies, etc.

Clearly, since there are regulatory requirements around SAR (suspicious activity), CTF (currency transactions) and KYC, it is important that the data quality enhancements first focus on those areas.

Anti-Money Laundering patent review

I was recently reviewing some anti-money laundering (AML) patents to see if any had been published recently (published does not mean granted).

Here’s a few links to some patents, some granted some applied for:

All of the patents describe a general purpose system of calculating a risk score. The risk score is based on several factors.

In AML, the key data include:

  • A legal entity (name, location, type)
  • A “location” (typically country) that determines the set of rules and “data lists”  to be applied. This could be the LE’s country or it could be the financial instrument’s country but generally this embodies a jurisdiction area that applies to the AML effort. A “data list” from a country or location is the list of legal entities that are being watched or have been determined to engage in AML operations. So we have a mix of suspected and validated data.
  • A financial instrument / product and its set of attributes such as transactions, amounts, etc.
  • A jurisdiction: the risk assessor’s set of rules. Typically these are rules created by a company or a line of business. These rules help identify an event and should be relatively consistent across an entire enterprise but also vary based on the set of locations where a company may operate. A bank’s Compliance Officer is especially concerned about this area as it also contains policies. The policies represent who needs to do what in which situation.

I have not tried to capture the nature of time in the above list since all of these components can change over time. Likewise, I did not try to capture all of the functions a AML system must perform such as regulatory reporting. We have also ignored whether all of these components are used in batch or real-time to perform a function. Or whether rules engines and workflow are powering some incredibly wonderful AML “cockpit” for an AML analyst at a company.

We assume that the ultimate goal of a AML system is to identify LE’s potentially engaging in AML activities. I write “potentially” because you need to report “suspicious” activities to the Financial Crimes Enforcement Network (FinCEN). We can never know for certain whether all of the data is accurate or that an individual transaction is actually fraudulent. We can, however, use rules, either a-priori or predictive, to identify potential AML events.

The patents describe a method of combining information, using a “computer system” to calculate a AML risk score. The higher the score, the more probable that an LE-FinancialProduct is being used for money laundering. Inherently, this is probabilistic. It’s also no different than any other risk scoring system. You have a bunch of inputs, there is formula or a predictive model, there is an output score. If something scores above a threshold, you do take action, such as report it to the government. Just as a note, there are also strict guidelines about what needs to be reported to the government as well as areas where there is latitude.

The trick in such a system is to minimize false positives–LE-FinancialProduct combinations  identified as money laundering but in reality are not. False positives waste time. So the system tries to create the best possible discrimination.

So now look at the patents using the background I just laid out. They are fairly broad, they described this basic analysis workflow. It’s the same workflow, using the same concepts as credit scoring for FICA scores, or credit scoring for many types of loans, or marketing scoring for lifetime value or next logical product purchasing. In other words, the approach is the same. Okay, these are like many existing patents out there. My reaction is the same: I am incredulous that general patents are issued like they are.

If you look past whether patents are being granted for general concepts, I think it is useful to note that many of these came out around 2005-2006 or so which is a few years after many regulations changed with the Patriot Act and other changes in financial regulations.

So the key thought is yes, patents are being submitted in this area but I think the relatively low number of patent applications in this area reflects that the general workflow is, well, pretty general. Alright, the 2011 patent has some cool “graph/link analysis” but that type of analysis is also a bit 1980s.

Note: I selected a few data concepts from the real-time AML risk scoring patent to give you a feel for the type of data used in AML around the transaction:

  • transaction amount,
  • source of funds such as bank or credit cards,
  • channel used for loading funds such as POS or ATM,
  • velocity such as count and amount sent in the past x days,
  • location information such as number of pre-paid cards purchased from the same zip code, same country, same IP address within x hours,
  • external data sources (.e.g. Interpol List) or internal data source

Opportunities for BigData and Heathcare: Need a little change management here

What are the bigdata opportunities in healthcare? Today, BigData techniques are already employed by startups because BigData technology today can be very cost effectively used to perform analytics  and gives startups an edge on the cost and capabilities front.

Big what are the opportunities in heatlhcare for established companies? I’ll offer the thought that it can be broken into two main categories. The categories reflect the fact that there are in-place data assets that will be in place for quite awhile. Its very difficult to move an entire infrastructure to a new technology base overnight. It is true that if some semblance of modern architecture (messaging, interfaces for data access) is in place today, the movement can be much faster because the underlying implementation can be changed without changing downstream applications.

The two categories are:

  • Move targeted, structured analytical workflows to BigData.
  • Enable new analytical capabilities that were previously not viable.

The first category speaks to the area of BigData that can make a substantial ROI appear fairly quickly. There are many well-undestood workflows today inside healthcare Payers, for example, that simply run too slow, are not robust or are unable to handle the volume. Purchasing another large, hardware based appliance is not the answer. But scaling out to cloudscale (yes using a public cloud for a Payer is considered leading edge but easy to do with the proper security in place) allows a Payer to use BigData technology cheaply. Targeted workflows, that are well understood but underperforming can be moved over to BigData technology. The benefits are substantial ROI for infrastructure and cost avoidance for future updates. The positive ROI that comes from these projects indicates that the transition pays for itself. It can actually occur quite quickly.

The second opportunity is around new analytical capabilities. Today, Payers and others cannot simple perform certain types of analytics easily because of limitations in the information management environments. These areas offer, assuming the business issue being addressed suggests it, substantial cost savings opportunities on the care side. New ways of disease management, outcomes research and network performance management can make substantial returns in under 2 years (it takes a year to cycle through provider network contracts and ensure the new analytics has a change to change the business process). Its these new capabilities that are most exciting.

The largest impediment to these areas of opportunity will be change management. Changing the way analytics are performed is difficult. Today, SAS is used more for data management than statistical analysis and is the defacto standard for the analytical environment. SAS offers grid and other types of larger data processing solutions. To use BigData, plans will have to embrace immature technology and the talent that must be hired to deploy it. But the cost curve could be substantially below that of scaling current environments–again paying for itself fairly quickly. Management and groups used to a certain analytical methodology (e.g. cost allocations) will have to become comfortable seeing that methodology implemented differently. Payers may seek to outsource BigData analytics tools and technologies but the real benefit will be obtained by retaining talent in-house over the long run even if some part of the work is outsourced. Because analytics is a core competency and Payers need to, in my opinion, retain some core versus just becoming a virtual shell, BigData needs to be an in-house capability.

Graph databases, metadata management and social networks

I was speaking to a friend the other day and they mentioned they were working on some metadata analysis. He had built a MS Access database to import the metadata. He found the going quick tricky as the analysis they were performing is called “data lineage” and they were having difficulty. He also wanted to analyze mappings between fields in the database as well as mappings between lists of values (a list of value is like the set of values you see in a dropdown box on a user interface). All of this seemed like social networking to me.

The way to think about is that (and I could use John Seely Brown’s Social Life of Information book to back me up here) the data lineage problem is just like a social network. You want to track something from its start to the next hop. The “friend” in this case, is the place where the data is transported to another system. Hence a “friend” of a piece of metadata must be another metadata item in another system or database table. Data lineage was nothing more than social networking. To me, data lineage would probably generate much simpler networks but I would guess that there are alot of grey areas about figuring out all the places that data is moved to or converted to along the way–that’s probably what makes it a much harder problem.

Naturally, not knowing whether it was possible or not I mentioned how graph databases could capture most of this data fairly easily and you could run very sophisticated queries. I had not really deeply thought about it but I had been reading up on graphs and probability & statistics, etc. So it seemed reasonable to me.

Of course, just doing a simple import of metadata into MS Access is fairly straightforward . You define some tables that capture a “table” concept and it has a bunch of relationships to “fields.” This can be modeled in RDBMS using foreign keys and such. But as you normalize out the other concepts, such as categories of tables, or try to describe different types of tables, such as views or other RDBM’ish structures, the MS Access approach starts getting a bit daunting.

But my friend wanted to deeply analyze the data and have something that could scale to much harder metadata problems. So I dipped into a neo4j manual and read some blogs. I then I ran across alot of blogs that described classification through taxonomies and ontologies and other types of very abstract ways of describing data. This became complicated very quickly and I realized that I wanted to try and do something small but not necessarily simple. I would need a graph model that was highly compact and could change as requirements changed (my friend said metadata requirements change all the time). And I would sacrifice the ease of a dedicated but highly rigid model for one that was general. I was essentially shifting complexity from the model itself to the processing layer that would sit above the model. But that’s fine if it came me something that exceptional room to grow.

So after reading the manual, the blogs and thinking about it for another hour. I realized that I could do most of what he wanted using a few very simple concepts:

  • A DataItem is a description of a data element or a value in a list of values. A DataItem could be part of multiple categories. We will call these categories DataItemSets.
  • A DataItemSet is  collection of DataItems. The sets could have a taxonomy (categories of categories) so that a set could be part of another set. I could not imagine sets of sets of sets, but it seemed that a friend could be a friend of a friend so a set could be a parent of a set.
  • DataItemRelationship will connect a set of “From” DataItems to a set of “To” DataItems. The From and To could be 1 to 1 but we wanted to keep it general. These are the edges of “LIKES” or “KNOWS” in the social network.
  • DataItemRelationshipSet will be the taxonomy for the relationships just like a DataItemSet. Unlike many social networks, you may need to classify a relationship with more information than just “LIKES.” Facebook gives you “likes” but a “like” is fairly general, you do not know how strong that like is for any given pair of nodes. So by having a taxonomy for the relationships, we can have categories of categories or whatever you want to more fully describe the relationship.

That’s it. Just 4 main graph node concepts. We will also need to label our nodes with the concept that it represents and to ensure that it has the right set of properties. So a small amount of “infrastructure” is needed to do this labeling and match a label to a set of properties that should be available on that node. For example, a DataItem that represents metadata will have different properties than a DataItem that represents a value in a list of values.

I thought that with these simple concepts we could construct everything that was needed. Since metadata is just data and list of values are just data, it seemed to me that the graph just conceptually holds data and we can treat both the same in the graph albeit with different node properties.

I’ll give it a whirl and report back by trying a very small experiment to see if this design is totally impractical to implement or if it really shines. I’ll also try to hook it up to cytoscape for visualization. However, its clear, just like with MS Access, if you want a solution quickly, just go buy a Global ID-type product.

The past 10 years of data warehousing has been all wrong…

This is an “ideas” post. One where I am trying to work out some ideas that have been bouncing around my head since this morning.

Essentially, the past 10 years of data warehousing have been wrong. Wrong in the sense that the area of data warehousing has not adapted to newer technologies that would solve fundamental data warehousing issues and increase the opportunity for success. Data warehousing projects come in all shapes an sizes. The smaller they are and the more focused the problem they are trying to solve, typically the more successful they are. This is because data warehousing has many non-technical issues that cause it to fail including issues such as : failure of the business to listen & communicate and the failure of IT to listen & communicate, requirements change as fast as the business changes (and that’s pretty fast for many areas such as sales and marketing) as well as sponsorship, sustainable funding and short-term commitment mentalities. Many of these factors are mitigated by smaller projects.

Hence, the application of “lean” and “agile” methodologies to data warehousing. These approaches are really a learning by doing model where you do a small amount of work, receive feedback, do a small amount of work, receive feedback, do a small amount of work….and so forth. Many tiny cycles with feedback help promote alignment between the data warehouse (and its current iteration) and what the business wants or thinks it wants. These approaches have helped but at the trade-off that its difficult to implement very large scale projects across different location models where developers are spread out around the world. So its helped, but large, complex projects must still be conducted and its clear coordinating a large team is just really hard.

Data warehousing technology has not substantially helped solve these problems. Today, larger databases that run very fast are available, but they are built using the old approach e.g. data models, ETL, etc. So those components just run faster. That helps of course because there is less time spent trying to optimize everything and therefore more time spent on other tasks, such as working with the business. But the current use of technology is not really solving lifecycle issues, it actually makes it worse. You have data modeling teams, ETL teams, architect teams, analyst teams–all of which have to piece together their components and have something large work. It is like building a rocket ship without large government funding.

BigData has stepped in and has made available other tools. But they are often applied and targeted at a very specific workflow–a  specific type of analysis–that can be programmed into what are generally fairly immature tools. So BigData is helping because it helps loosen up an architect’s thinking around how to put together solutions as well as employ non-traditional technologies.

So what would help? Let’s consider a world where compute power for each user is effectively infinite. We are not saying its free, just that its relatively easy to get enough compute power to solve specific types of problems. Lets also assume that the non-technical issues will not change its an invariant in this scenario. And lets assume we want to use some elements of technology to address non-technical issues.

In this scenario, we really need a solution that has a few parts to it.

  • We need better tools & technologies that allow us to deliver solutions but deliver solutions under a rapid pace with significantly more updates than even today’s technologies. Lets assume that the word “update” means both the data updates frequently as well as the structure of the data changes frequently.
  • We need to be able to use one environment so that people creating the solutions do not have change the toolset and make diverse toolsets work together. This is one of the reasons why SAS is so popular–you can stay in one toolset.
  • We also need technologies that allow a lifecycle process to work with small teams who combine their solution components together more easily and whenever they are ready–versus when large, team milestones say that components have to be integrated.
  • We need to support processes that span the globe with people who contribute both technical and domain knowledge. We want to support decoupling the teams.

Let’s imagine a solution then. Let’s assume that every piece of data coming into our technology solution is tagged. This one value is tagged as being part of a healthcare claim and represent a diagnosis code. You tag it as being a diagnosis code, as being part of a claim, as being a number, etc. You can describe that relationship. Let’s tag all the data this way. Essentially, you are expanding the context of the data. Now lets assume that we can establish these tags and hence, relationships, between all the data elements and lets also assume that we have a tool that can change these relationships dynamically so that we can create new relationships (pathways) between the data. Of course, ETL conceptually does not go away, but lets assume that ETL becomes more of a process operating at different scales, the data element level, the relationships level, the aggregate of data tag level, etc.

Now, because we have infinite computing resources, we can start assembling the data the way we would like. If we are technologist, perhaps we assemble the way that is helpful for putting together a production report. If we are an analyst, we might assemble it in a way that helps us determine if an outcome measurement improved based on an intervention (which has its own set of tags). When we assemble, we actual describe how data is grouped together to form a hierarchy of concepts. A DX code is a field that belongs to a claim, or a field that belongs to clinical indicators. Indicators are related to procedures through a probabilistic relationship based on past-seen relationships or programmed relationships.

Given that we can assemble and reassemble, let’s also imagine that at any time we can copy all of the data and all the tags. We can go to a master area and just say, I would like to copy it so I can fiddle with my tags and if everyone says they like my tags, I may contribute them back to the master. And lets assume that if the master dataset is updated with recent data, I can just merge those data into my working set. Essentially, we have checked out the entire dataset, track our changes to it, update it with other changes from other people and can check our changes back in for other to use–very much like a data change management solution. As the tags evolve, other people can assemble and reassemble the data in new ways.

So one solution to help fix data warehousing is to employ BigData technology but in a way that allows us to assemble and analyze the data they each individual wants to. And when that individual creates something useful, to share it with others so they can use it. The NoSQL database conceptually give us this capability especially when the data is represented by something as simple as a key+value. Source code control systems like “git” (large scale, distributed management system) give us a model to shoot for but at the data warehouse level and the current crop of ETL programs inform us of the types of changes that need to be made to be data to improve the quality for use.

Many of the ingredients exist today we just need the innovation to happen.