Data Science | Gregory Lampshire

data lakes: old is new and no free lunch, rinse and repeat

I recently watched a few videos from the dremio sponsored data lake conference: https://www.dremio.com/press-releases/introducing-subsurface-the-cloud-data-lake-conference/.

It’s a good collection of videos about a relatively new topic, data lakes. Data lakes are an architectural focal point for data management.

Some people think data lakes are new, especially vendors selling you on data lake tools and consulting. The new hotness is “separating compute and storage,” although that’s been going on for nearly four decades. Even though data lakes are the new hotness, rumors suggest that data lakes are hard to show and deliver ROI. There are many reasons this may be true. We should step back and look at data lakes. Data lakes are nothing new, but their implementations have changed.

Let’s start with a bit of history, around the late 80s and early 90s when data-warehouses roamed the earth. Data warehouses were hot until they weren’t.

Data warehouses were an universal answer to a variety of data management and organizational problems. Today, most people love to make the data warehouses the bogeyman. Data warehouse projects became widow-makers for IT managers. It was always unfair to ask IT managers to smooth over differences in priorities, delivery speeds, and data/analytical needs in the different divisions. Although my point of view is not widespread, after many years helping companies with their analytics, it’s clear that IT is the wrong place to produce a wide range of analytical products consumed by a wide range of users. Budgets for analytics should be borne by those that need it. A few “data products” can be consolidated for cost efficiency into a shared service group like IT. In some cases, if there is a common need or cost control, sure, IT may be an Ok choice where to do these things, but in general, it is not and never will be. That’s just the way business works.

At least in my world, a data warehouse’s inputs and outputs were almost always provided to different data consumers–the data warehouse itself was not the only physical data asset. But this approach and point-of-view was not the standard design approach. Data-warehouses became hard-to-use siloes almost *by definition*. One client hired me to find out why a data-warehouse had no users. The primary user said the IT group turned off his access and did not have the data they needed. Case closed! Many IT managers wanted to control these files to control “one version of the truth,” but it is not efficient to force IT to be the owner of these business issues. You do need one particular place to go for a business measure, but it is not necessarily IT’s job to own and publish it.

By providing inputs and outputs from a data warehouse, a data warehouse became a “cache” of pre-computed values. Whether it was a database table, a cube, or another proprietary data structure, there was always a cache. It is usually too expensive to always recompute a result of raw source data. Storage and compute may be cheap but not free. Caching is not a technical issue. Think economics. The caches are more convenient and less costly to access. Even in a cloud environment, there is a cost to recompute from the raw data. To build a cache, you have to specify what you want before you need it. Even with automatic caching, you need to be thoughtful. And cloud, incremental work is often not capitalizable.

Data virtualization, mostly on-premise, came later in the late 90’s early 2000’s. You could combine data from any source, raw source data, data warehouses, downstream extracts, excel files on your desktop, and query it without having to have prepared the data prior. Of course, to get anything useful, you would have to reproduce many of the same business data processing steps you have to regardless of your data management approach. In some scenarios, this was a huge step forward. The pharmaceutical industry, with vast amounts of unintegrated data, complex formats such as those found in clinical trials, and other domain areas really benefits from this approach. Interestingly enough to get good and fast results, data virtualization tools always had a giant cache in the middle along with a query planner and execution engine.

Enter the cloud and data lakes.

A data lake is a set of inputs and outputs. It is a cache of intermediate computations for some. For others, it is a source of raw information. It usually has data in several multiple formats for tool convenience. It often has some metadata, lineage, and other “management” features to help navigate and understand what is available. Typically, a wide variety of tools are available that work with several, although not infinite, number of data formats. When these types of features are essential to your users, then a data lake makes sense.

Today’s data lake companies are trying to convince you that data warehouses are evil. In many ways, I agree with them because most of them were designed wrong. However, the thinking and effort that goes into a data-warehouse never really goes away. Even in a cloud environment, you still pretty much have to do the same thing as you would build a “thing with a cache in the middle.” At some point, you have to specify what you want to do to the data to make it ready for use. Business intent and processing is inevitable. There is no free lunch.

Fortunately, newer tools, like dremio’s, AWS, Azure, and many others, make this more accessible than before. Most modern tools recognize that there are many formats, access patterns, and data access needs–one size does not fit all. This point of view alone makes these tools better than the traditional “single ETL tool” and “single DW database” approach from the decade prior.

Data lake companies provide tools and patterns that *are* more useful in a highly complex and distributed (organizationally and technically).

Look at dremio.

Dremio has a great product. I like it. It is cast as a data lake engine because data lakes are still kind-of hot in the market. It is really a data virtualization tool well suited for a cloud environment. Highly useful in a situation where you want to provide access to the data in a wide variety of formats and access technologies and tools. Yes, there is a finite list of “connectors.” At least, however, part of dremio, such as apache arrow and arrow flight, is open-source so you can add your own.

dremio has to implement patterns that have been used for decades, even if dremio describes it differently. To make it fast enough and lower costs, it has a cache in the middle, although optional. It has a C++ core, instead of something less efficient, it targets zero-copy transfers through the networking and app stack. It uses code generation to push computation to different locations.

Many, if not most, of these features, were implemented four decades ago for MPP internetworking and were present in Ab-Initio and Torrent data processing products if anyone remembers them. Columnar databases with compression were available three decades ago–I used them. Separate compute and storage, break apart the RDBMS into pieces and retarget them. Check! I’m not saying that everyone is saying these are completely new concepts and have never been done before.

However, newer products like dremio’s are better than yesterday’s tools. Their mindset and development approach is entirely different. Sure, they are not doing anything new architecturally, but that makes them easy to figure out and use. Under the hood, they must build out the same building blocks needed to process data like any product–you cannot escape gravity. They are doing things new design-wise. They are making a better product. Recognizing these basic ideas should help large enterprises adopt and integrate products like dremio.

The sins of data-warehousing and proprietary tools, in general, are many. Most likely, proprietary tools probably still make more money daily than open-source tools. Open-source tools may have higher valuations. Perhaps this reflects their ability to be used by more companies in the long run. Open-source tools are cheaper for the moment. There are more product choices.

In the long run, no market can sustain a large number of products, so when the Fed finally stops supporting companies and capitalism returns, you may see a shrinking of funds around open-source data management tools.

All is not perfect, but it is better than before. Data lakes can be useful because they were useful 20 years ago when they existed at companies but had different names, e.g., the “input layer” or the “extract layer.” Insurance companies loved the “extract” layer because their source systems were many and complex and if you could find the right extract, life was easier. I’m hoping tools like dremio get situated and last in the long run because they are better.

Companies are building non-open parts of their product to monetize and incentivize companies. They still need income. Like previous tools they displaced, these newer tools will be displaced by others unless they get embedded enough at a client or another software company and create a sustainable source of income. Look at Palantir, for example. They have a little open-source, but their core product is behind the firewall. Many of these companies use open-source as a cover for coolness, but their intent is monetized proprietary software. I’m not against that, but we should recognize the situation so we are smarter about our decisions of what to use.

The cycle will continue. Rinse and repeat.

yes, yet another bigdata summary post…now it’s a party

Since I am “recovering” data scientist, I thought that once in awhile, it would be good deviate from my more management consulting articles and eyeball the bigdata landscape to see if something interesting has happened.

What!?! It seems like you cannot read an article without encountering yet another treatise on bigdata or at the very least, descriptions of the “internet of things.”

That’s true, but if you look under the hood, the most important benefits of the bigdata revolution have really been on two fronts. First, recent bigdata technologies have decreased the cost of analytics and this makes analytics more easily available to smaller companies. Second, the bigadata bandwagon has increased awareness that analytics are needed to run the business. Large companies could long afford the investments in analytics which made corporate size an important competitive attribute. The benefits from analytics should not lead to a blanket and unthoughtful endorsement of analytics. Not every business process, product or channel needs overwhelming analytics. You want, however, analytics to be part of the standard toolkit for managing the value chain process and decision making.

The ability to process large amounts of data, beyond what mainframes could do, has been with us for years-twenty to thirty years The algorithms developed decades ago are similar to the algorithms and processing schemes pushed in the bigdata world today. Teradata helped created the MPP database and SQL world. AbInitio (still available) and Torrent (with their Orchestrate product sold to IBM eventually) defined the pipeline parallelism and data parallelism data processing toolchain world. Many of the engineers at these two ETL companies came from Thinking Machines. The MPI API defined parallel processing for the scientific world (and before that PVM and before that…).

All of these technologies were available decades ago. Mapreduce is really an old lisp concept of map and fold which was available in parallel from Thinking Machines even earlier. Today’s tools build on the paradigms that these companies created in the first pass of commercialization. As you would expect, these companies built on what had occurred before them. For example, parallel filesystems have been around for a long time and were present on day one in those processing tools mentioned above.

Now that the hype around mapreduce is declining and its limitations are finally becoming widely understood, people recognize that mapreduce is just one of several parallel processing approaches. Free from the mapreduce-like thinking, bigdata toolchains can finally get down to business. The bigdata toolchains realize that sql query expressions are a good way to express computations. Sql query capabilities are solidly available in most bigdata environments. Technically, many of the bigdata tools provide “manual” infrastructure to build the equivalent sql commands. That is, they provide the parsing, planning and distribution of the queries to independent processing nodes.

I consider the current bigdata “spin” that started a about 1-2 years ago healthy because it increased the value of other processing schemes such as streaming, real-time query interaction and graphs. To accommodate these processing approaches, the bigdata toolchains have changed significantly. Think SIMD, MIMD, SIPD and all the different variations.

I think the framework developers have realized that these other processing approaches require a general purpose parallel execution engine. An engine that AbInitio and others have had for decades. You need to be able to execute programs using a variety of processing algorithms where you think of the “nodes” as running different types of computations and not just a single mapreduce job. You need general purpose pipeline and data parallelism.

We see this in the following open-source’ish projects:

Hadoop now as a real resource and job management subsystem that is a more general parallel job scheduling tool. It is now useful for more genera parallel programming.
Apache Tez helps you build general jobs (for hadoop).
Apache Flink builds pipeline and data parallel jobs. Its also a general purpose engine e.g. streaming, …
Apache Spark builds pipeline and data parallel jobs. Its also a general purpose engine e.g. streaming, ..
Apache Cascading/Scalding builds pipeline and data parallel jobs, etc.
DataTorrent: streaming and more.
Storm: Streaming
Kafka: Messaging (with persistency)
Scrunch: Based on apache crunch, builds processing pipelines
…many of the above available as PaaS on AWS or Azure…
…

I skipped many others of course and I am completely skipping some of the early sql-ish systems such as hive and I have skipped visualization, which I’ll hit in another article. Some of these have been around for a few years in various stages of maturity. Most of these implement pipeline parallelism and data parallelism for creating general processing graphs and some provide sql support where that processing approach makes sense.

In addition the underlying engines, what’s new? I think some very important elements: usability. The tools are a heck-of-alot easier to use now. Here’s why.

What made the early-stage (20-30 years ago) parallel processing tools easier to use was that they recognized, due to their experience in the parallel world, that usability by programmers was key. While it is actually fairly easy to get inexpensive scientific and programming talent, programming parallel systems has always been hard. It needs to be easier.

New languages are always being created to help make parallel programming easier. Long ago, HPF and C* among many were commercial variations of the same idea. Programmers today want to stay within their toolchains because switching toolchains to run a data workflow is hard work and time consuming to develop. Many of today’s bigdata tools allow multiple languages to be used: Java, Python, R, Scala, javascript and more. The raw mapreduce system was very difficult to program and so user-facing interfaces were provided, for example, cascading. Usability is one of the reasons that SAS is so important to the industry. It is also why Microsoft’s research Dryad project was popular. Despite SAS’s quirks, its alot easier to use than many other environments and its more accessible to the users who need to create the analytics.

In the original toolsets from the vendors mentioned earlier in this article, you would program in C++ or a special purpose data management language. It worked fine for those companies who could afford the talent that could master that model. In contrast to today, you can use languages like python or scala to run the workflows and use the language itself to express the computations. The language itself is expressive enough that you are not using the programming environment as a “library” that you make programming calls to. The language constructs are translated into the parallel constructs transparently. The newer languages, like lisp of yore, are more functionally oriented. Functional programming languages come with a variety of capabilities that makes this possible. This was the prize that HPF and C* were trying to win. Specialized languages are still being developed that help specify parallelism and data locality without being “embedded” in other modern languages and they to can make it easier to use the new bigdata capabilities.

While the runtimes of these embedded parallel capabilities are still fairly immature in a variety of ways. Using embedded expressions, data scientists can use familiar toolchains, languages and other components to create their analytical workflows easier. Since the new runtimes allow more than just mapreduce, streaming, machine learning and other data mining approaches suddenly becomes much more accessible at large scale in more ways than just using other tools like R and others.

This is actually extremely important. Today’s compute infrastructure should not be built with rigid assumptions about tools, but be “floatable” to new environments where the pace of innovation is strong. New execution engines are being deployed at a fantastic rate and you want to be able to use them to obtain processing advantages. You can only do that if you are using well known tools and technologies and if you have engineered your data (through data governance) to be portable to these environments that often live in the cloud. It is through this approach that you can obtain flexibility.

I won’t provide any examples here, but lookup the web pages for storm and flink for examples. Since sql-like query engines are now available in these environments, this also contributes to the user-friendliness.

Three critical elements are now in play: cost effectiveness, usability and generalness.

Now its a party.

Oso Mudslides and BigData

There was much todo about google’s bigdata bad flu forecasts recently in the news. google had tried to forecast flu rates in the US based on search data. That’s a hard issue to forecast well but doing better will have public benefits by giving public officials and others information to identify pro-active actions.

Lets also think about other places where bigdata, in a non-corporate, non-figure-out-what-customers-will-buy-next way, could also help.

Let’s think about Oso, Washington (Oso landslide area on google maps)

Given my background in geophysics (and a bit of geology), you can look at Oslo, Washington and think…yeah…that was a candidate for a mudslide. Using google earth, its easy to look at the pictures and see the line in the forest where the earth has given way over the years. It looks like the geology of the area is mostly sand and it was mentioned it was glacier related. All this makes sense.

We also know that homeowner’s insurance tries to estimate the risk of a policy before its issued and its safe to assume that the policies either did not cover mudslides or catastrophes of this nature for exactly this reason.

All of this is good hind-sight. How do we do better?

Its pretty clear from the aerial photography that the land across the river was ripe for a slide. The think sandy line, the sparse vegetation and other visual aspects from google earth/maps shows that detail. Its a classic geological situation. I’ll also bet the lithography of the area is sand, alot of sand, and more sand possible on top of hard rock at the base.

So lets propose that bigdata should help give homeowners a risk assessment of their house which they can monitor over time and use to evaluate the potential devastation that could come from a future house purchase. Insurance costs alone should not prevent homeowners from assessing their risks. Even “alerts” from local government officials sometimes fall on deaf ears.

Here’s the setup:

Use google earth maps to interpret the images along rivers, lakes and ocean fronts
Use geological studies. Its little known that universities and the government have conducted extensive studies in most areas of the US and we could, in theory, make that information more accessible and usable
Use aerial photography analysis to evaluate vegetation density and surface features
Use land data to understand the terrain e.g. gradients and funnels
Align the data with fault lines, historical analysis of events and other factors.
Calculate risk scores for each home or identify homes in an area of heightened risk.

Do this and repeat monthly for every home in the US at risk and create a report for homeowners to read.

Now that would be bigdata in action!

This is a really hard problem to solve but if the bigdata “industry” wants to prove that its good at data fusion on a really hard problem that mixes an extremely complex and large amount of disparate data and has public benefit, this would be it.

Ranking information, “winner take all” and the Heisenberg Uncertainty Principle

Does ranking information, who likes what, top-10 rankings, produce “winner take all” situations?

There is an old rule in strategy which is that there are no rules. While it is always nice to try and create underlying theories or rules of how the world works, science and mathematics are still fairly young to describe something this complex. Trying to apply rules like the concept presented in the title is probably some form of confirmatory bias.

Having said that, there is evidence that this effect can happen, not as a rule to be followed, but something that does occur. How could this happen?

Ranking information does allow us, as people, to see what other people are doing. That’s always interesting–to see what others are doing, looking at or thinking about. And by looking at what other people are looking at, there is a natural increase in “viewership” of that item. So the top-10 ranking, always entertaining of course, does create healthy follow-on “views.”

But “views” does not mean involvement or agreement. In other word, while ranking information and today’s internet makes it easy to see what other are seeing, our act of observation actually contributes to the appearance of popularity. That popularity appears to drive others to “winner take all.”

“Winner take all” can take many forms. Winner take all can meant that once the pile-on starts, that a web site becomes very popular. This is often confused with the network effect. Winner take all can also mean that a song becomes popular because its played alot, so more people like it, so its played even more, and so on. Of course this does not describe how the song became popular to begin with–perhaps people actually liked the song and it had favorable corporate support–there is nothing wrong with that.

And this leads us to the uncertainty principle. The act of observation disturbs the thing we are trying to measure. The more scientific formulation has to do with the limits of observation of position and velocity at the atomic level but we’ll gloss over that more formal definition.

The act of observing a top-10 list on the internet, causes the top-10 list to become more popular. The act of listening to a song, and through internet communication channels, changes the popularity of the song. So its clear, that given internet technology, there is a potential feedback loop that employs the uncertainty principle.

Alright, that makes sense. But the world is probably a little more complex than this simple thought.

While the act of observing could make something more popular, that does not mean that the act of observing makes something unpopular turn into something popular. In other words, people are not fools. They like what they like. If something is on the top-10 list or comes from a band that has good corporate airtime support, that does not mean that it is a bad song or a bad list. It does not mean that people would not like it if did not play in that venue.

The internet is a powerful tool to help people quickly find what they want. The cost of finding a new web site or a new source of top-10 lists (or whatever) is fairly low. So there is no real inherent lock-in. There is the possibility that given the internet’s reach, the ability to more rapidly escalate and de-escalate from “winner take all” to “has been” is fairly robust. Its quite possible, in the spirit of making business rules for fun, that the internet produces a steady stream of “winner take all” and if there is a steady stream of “winner take take all” events, then they are really just average events after all (regression to the mean). So with my fancy new rule, there are no “winner take all” events any more, just a large number of rapidly escalating/de-escalating average events–the frequency has just been bumped up.

That’s okay as well.

Tempering our expectations for bigdata in healthcare

Expectations around bigdata’s impact on healthcare is leaping ahead of reality and some good thoughts are being expressed. However, healthcare has already had significant amounts of analytics applied to it. The issue is not that larger sets of data are critical, but that the sharing and integration of the data are the critical parts for better analysis. Bigdata does not necessarily solve these problems although the bigdata fever may help smash through these barriers. Over 15 Blues and most of the major nationals have already purchased data warehouse appliances and advanced systems to speed-up analysis, so its not necessarily performance or scalability that is constraining advances built on data-driven approaches. And just using unstructured text in analytics will not create a leapfrog in better outcomes from data.

We really need to think integration and access. More people performing analysis in clever ways will make a difference. And this means more people than just the few that can access healthcare detailed data: most of which is proprietary and will stay proprietary to companies that collect it. Privacy and other issues prevent widespread sharing of the granular data needed to truly perform analysis and get great results…its a journey.

This makes the PCORI announcements about yet another national data infrastructure (based on a distributed data model concept) and Obama’s directive to get more Medicare data into the world for innovation (see the 2013 Healthcare Datapooloza that just completed in Washington DC) that much more interesting. PCORI is really building a closed network of detailed data using a common data model and distributed analysis while CMS is being pushed to make datasets more available to entrepreneurs and innovators–a bit of the opposite in terms of “access.”

There are innovative ideas out there, in fact, there is no end to them. Bigdata is actually a set of fairly old ideas that are suddently becoming economic to implement. And there is serious lack of useful datasets that are widely available. The CMS datasets are often heavily massaged prior to release in order to conform to HIPAA rules e.g. you cannot provide detailed data at an individual level essentially despite what you think you are getting: just stripping off a name and address off a claim form is sufficient for satisfying HIPAA rules.

So its clear that to get great results, you probably have to follow the PCORI model, but then analysis is really restricted to just a few people who can access those datasets.

That’s not to say that if patients are willing to opt-in to programs that get their healthcare data out there, bigdata does not have alot to offer. Companies using bigdata technology on their proprietary datasets can make a difference and there are many useful ideas to economically go after using bigdata–many of which are fairly obvious and easy to prioritize. But there is not going to suddenly be a large community of people with new access to granular data that could be, and often is, the source of innovation. Let’s face it. Many healthcare companies have had advanced analytics and effectively no real budget constraints for many years and will continue to do so. So the reason that analytics have not been created deployed more than today is unrelated to technology.

If bigdata hype can help executives get moving and actually innovate (its difficult for executives to innovate versus just react in healthcare) then that’s a good thing and getting momentum will most likely be the largest stimulus to innovation overall. That’s why change management is key when using analytics for healthcare.

Anti-Money Laundering (AML) and Combating Terrorist Funding (CTF) analytics review

In my last blog I reviewed some recent patents in the AML/CTF space. They describe what I consider some very rudimentary analytics workflows–fairly simple scoring and weighting using various a-priori measures. Why are such simple approaches patentable? To give you sense of why I would ask this question, there was a great trumpeting of news around the closing of a $6b money laundering operation at Liberty Reserve. But money laundering (including terrorism funding) is estimated at $500 billion to $1 trillion per year. That’s alot of badness that needs to be stopped. Hopefully smarter is better.

There are predictive analytical solutions to various parts of the AML problem and there is a movement away from rules-only systems (rules are here to stay however since policies must still be applied to predictive results). However, the use of predictive analytics is slowed because AML analytics boils down to an unsupervised learning problem. Real-world test cases are hard to find (or create!) and the data is exceptional noisy and incomplete. The short message is that its a really hard problem to solve and sometimes simpler approaches just work easier than others. However, in this note, I’ll describe the issues a bit more and talk about where more advanced analytics have come into play. Oh and do not forget, on the other side of the law, criminals are actively and cleverly trying to hide their activity and they know how banks operate.

The use of algorithms for AML analytics is advancing. Since AML analytics can occur at two different levels, the network and the individual level, its pretty clear that graph theory and other techniques that operate on the data in various ways are applicable. AML Analytics is not simply about a prediction that a particular transaction, legal entity or gorup of LE’s are conducting money laundering operations. Its best to view AML analytics as a collection of techniques from probabilistic matching to graph theory to predictive analytics combining together to identify suspicious transactions or LEs.

If the state AML analytics is relatively maturing, what is the current state? Rather simple actually. Previous systems, including home grown systems, focused on the case management and reporting aspects (that’s reporting as in reporting on the data to help an analyst analyze some flows as well as regulatory reporting). AML Analytics was also typically based on sampling!

Today, bigdata can help to avoid sampling issues. But current investments are focused around the data management aspects because poor data management capabilities have greatly exacerbated the cost of implementing AML solutions. FS institutions desperately need to reduce these costs and comply with what will be an ever-changing area of regulation. “First things first” seems to be the general thrust around AML investments.

Since AML analysis will be based on Legal Entities (people and companies) as well as products, its pretty clear that the unique identification of LEs and the hierarchy/taxonomies/classifications of financial instruments is an important data management capability. Results from AML Analytics can be greatly reduced if the core data is noisy. When you combine the noisy data problem with today’s reality of highly siloed data systems inside of Banks and FS institutions, the scope of trying to implement AML Analytics is quite daunting. Of course, start simple and grow it.

I mentioned above that there are not alot of identifiable cases for training algorithms. While it is possible to flag some transactions and confirm them, companies must report Suspicious Activity Reports (SAR) to the government. Unfortunately, the government does not provide a list of “identified” data back. So it is difficult to formulate a solution using supervised learning approaches. That’s why it is also important to attack the problem from multiple analytical approaches–no one method dominates and you need multiple angles of attack to help tune your false positive rates and manage your workload.

When we look at the underlying data, its important to look at not only the data but also the business rules currently (or proposed) in use. The business rules will help identify how the data is to be used per the policies set by the Compliance Officer. The rules also help orient you on the objectives of the AML program at a specific institution. Since not all institutions transact all types of financial products, the “objectives” of an AML system can be very different. Since the objectives are different, the set of analytics used are also different. For example, smaller companies may wish to use highly iterative what-if scenario analysis to refine the policies/false positive rates by adjusting parameters and thresholds (which feels very univariate). Larger banks need more sophisticated analysis based on more advanced techniques (very multi-variate).

We’ve mentioned rules (a-priori knowledge, etc.) and predictive/data mining models (of all kinds since you can test deviations from peer groups using data mining methods, and predicted versus actual patterns etc.) and graph theory (link analysis). We’ve also mentioned master data management for LEs (don’t forget identity theft) and products as well taxonomies, classifications and ontologies. But we also cannot forget time series analysis for analyzing sequential events. That’s a good bag of data mining tricks to draw from. The list is much longer. I am often reminded of a really great statistics paper called Bump Hunting in High Dimensional Data by Jerome Friedman and Nick Fisher because that’s conceptually what we are really doing. Naturally, criminals wish to hide their bumps and make their transactions look like normal data.

On the data site, we have mentioned a variety of data types. The list below is a good first cut but you also need to recognize that synthesizing data, such as from aggregations (both time based aggregations and LE based aggregations such as transaction->account->person LE->group LE), are also important for the types of analytics mentioned above:

LE data (Know Your Customer – KYC)
General Ledger
Detailed Transaction data
Product Data
External sources: watch lists, passport lists, identity lists
Supplemental: Reference data, classifications, hierarchies, etc.

Clearly, since there are regulatory requirements around SAR (suspicious activity), CTF (currency transactions) and KYC, it is important that the data quality enhancements first focus on those areas.

Anti-Money Laundering patent review

I was recently reviewing some anti-money laundering (AML) patents to see if any had been published recently (published does not mean granted).

Here’s a few links to some patents, some granted some applied for:

All of the patents describe a general purpose system of calculating a risk score. The risk score is based on several factors.

In AML, the key data include:

A legal entity (name, location, type)
A “location” (typically country) that determines the set of rules and “data lists” to be applied. This could be the LE’s country or it could be the financial instrument’s country but generally this embodies a jurisdiction area that applies to the AML effort. A “data list” from a country or location is the list of legal entities that are being watched or have been determined to engage in AML operations. So we have a mix of suspected and validated data.
A financial instrument / product and its set of attributes such as transactions, amounts, etc.
A jurisdiction: the risk assessor’s set of rules. Typically these are rules created by a company or a line of business. These rules help identify an event and should be relatively consistent across an entire enterprise but also vary based on the set of locations where a company may operate. A bank’s Compliance Officer is especially concerned about this area as it also contains policies. The policies represent who needs to do what in which situation.

I have not tried to capture the nature of time in the above list since all of these components can change over time. Likewise, I did not try to capture all of the functions a AML system must perform such as regulatory reporting. We have also ignored whether all of these components are used in batch or real-time to perform a function. Or whether rules engines and workflow are powering some incredibly wonderful AML “cockpit” for an AML analyst at a company.

We assume that the ultimate goal of a AML system is to identify LE’s potentially engaging in AML activities. I write “potentially” because you need to report “suspicious” activities to the Financial Crimes Enforcement Network (FinCEN). We can never know for certain whether all of the data is accurate or that an individual transaction is actually fraudulent. We can, however, use rules, either a-priori or predictive, to identify potential AML events.

The patents describe a method of combining information, using a “computer system” to calculate a AML risk score. The higher the score, the more probable that an LE-FinancialProduct is being used for money laundering. Inherently, this is probabilistic. It’s also no different than any other risk scoring system. You have a bunch of inputs, there is formula or a predictive model, there is an output score. If something scores above a threshold, you do take action, such as report it to the government. Just as a note, there are also strict guidelines about what needs to be reported to the government as well as areas where there is latitude.

The trick in such a system is to minimize false positives–LE-FinancialProduct combinations identified as money laundering but in reality are not. False positives waste time. So the system tries to create the best possible discrimination.

So now look at the patents using the background I just laid out. They are fairly broad, they described this basic analysis workflow. It’s the same workflow, using the same concepts as credit scoring for FICA scores, or credit scoring for many types of loans, or marketing scoring for lifetime value or next logical product purchasing. In other words, the approach is the same. Okay, these are like many existing patents out there. My reaction is the same: I am incredulous that general patents are issued like they are.

If you look past whether patents are being granted for general concepts, I think it is useful to note that many of these came out around 2005-2006 or so which is a few years after many regulations changed with the Patriot Act and other changes in financial regulations.

So the key thought is yes, patents are being submitted in this area but I think the relatively low number of patent applications in this area reflects that the general workflow is, well, pretty general. Alright, the 2011 patent has some cool “graph/link analysis” but that type of analysis is also a bit 1980s.

Note: I selected a few data concepts from the real-time AML risk scoring patent to give you a feel for the type of data used in AML around the transaction:

transaction amount,
source of funds such as bank or credit cards,
channel used for loading funds such as POS or ATM,
velocity such as count and amount sent in the past x days,
location information such as number of pre-paid cards purchased from the same zip code, same country, same IP address within x hours,
external data sources (.e.g. Interpol List) or internal data source

Opportunities for BigData and Heathcare: Need a little change management here

What are the bigdata opportunities in healthcare? Today, BigData techniques are already employed by startups because BigData technology today can be very cost effectively used to perform analytics and gives startups an edge on the cost and capabilities front.

Big what are the opportunities in heatlhcare for established companies? I’ll offer the thought that it can be broken into two main categories. The categories reflect the fact that there are in-place data assets that will be in place for quite awhile. Its very difficult to move an entire infrastructure to a new technology base overnight. It is true that if some semblance of modern architecture (messaging, interfaces for data access) is in place today, the movement can be much faster because the underlying implementation can be changed without changing downstream applications.

The two categories are:

Move targeted, structured analytical workflows to BigData.
Enable new analytical capabilities that were previously not viable.

The first category speaks to the area of BigData that can make a substantial ROI appear fairly quickly. There are many well-undestood workflows today inside healthcare Payers, for example, that simply run too slow, are not robust or are unable to handle the volume. Purchasing another large, hardware based appliance is not the answer. But scaling out to cloudscale (yes using a public cloud for a Payer is considered leading edge but easy to do with the proper security in place) allows a Payer to use BigData technology cheaply. Targeted workflows, that are well understood but underperforming can be moved over to BigData technology. The benefits are substantial ROI for infrastructure and cost avoidance for future updates. The positive ROI that comes from these projects indicates that the transition pays for itself. It can actually occur quite quickly.

The second opportunity is around new analytical capabilities. Today, Payers and others cannot simple perform certain types of analytics easily because of limitations in the information management environments. These areas offer, assuming the business issue being addressed suggests it, substantial cost savings opportunities on the care side. New ways of disease management, outcomes research and network performance management can make substantial returns in under 2 years (it takes a year to cycle through provider network contracts and ensure the new analytics has a change to change the business process). Its these new capabilities that are most exciting.

The largest impediment to these areas of opportunity will be change management. Changing the way analytics are performed is difficult. Today, SAS is used more for data management than statistical analysis and is the defacto standard for the analytical environment. SAS offers grid and other types of larger data processing solutions. To use BigData, plans will have to embrace immature technology and the talent that must be hired to deploy it. But the cost curve could be substantially below that of scaling current environments–again paying for itself fairly quickly. Management and groups used to a certain analytical methodology (e.g. cost allocations) will have to become comfortable seeing that methodology implemented differently. Payers may seek to outsource BigData analytics tools and technologies but the real benefit will be obtained by retaining talent in-house over the long run even if some part of the work is outsourced. Because analytics is a core competency and Payers need to, in my opinion, retain some core versus just becoming a virtual shell, BigData needs to be an in-house capability.

Social media, BigData and Effort-Return

The classic question we ask about marketing, or really any form of outreach, is given the effort I expend, what is my return. This Effort-Return question is at the heart of ROI, value proposition and the general, down-to-earth question of “Was it worth it?”

That’s the essential question my clients have always asked me and I think that’s a big question that is being formed around the entire area of social media and BigData. Its clear that social media is here to stay. The idea that “people like us” create our own content is very powerful. It gives us voice, it gives us a communication platform and it gives us, essentially, power. The power of the consumer voice is amplified.

Instead of 10 people hearing us when we are upset at something, we can have 1,000,00 hear us. That’s power. And the cost of “hearing” the content, of finding it, is dropping dramatically. It is still not free to access content, you still have enormous search costs (we really need contextual search–searching those resources most relevant to me instead of searching the world). But search costs are dropping, navigation costs are dropping. Every year, those 1,000,000 can listen and filter another 1,000,000 messages.

BigData has come our rescue…in a way. It gives more tools to the programmers who want to shape that search data, who want to help us listen and reach out. There’s alot of hype out there and the technology is moving very fast, so fast, that new projects, new frameworks and new approaches are popping out every day.

But is it worth the effort? If so, just how much is it worth it? That’s still a key question. The amount of innovation in this area is tremendous and its not unlike the innovation I see occurring in the healthcare space. Everyone is trying something which means that everything is being tried somewhere.

That’s good. But it pretty clear already that while we can now communicate in many new channels unavailable 5 years ago, we can communicate easier, more frequently and find things that interest us, does it really pay-off? Do companies benefit by investing now and trying to get ahead or do they just try to keep pace and meet some, but not all, customer expectations. Will entire companies fall because of social media and BigData?

Those are hard questions to answer, but I think we can look at other patterns out there and see that even with all the hype today, it will be worth it. Companies should invest. But many should not over invest. When the internet was first widely used, it was thought that it would change the world. Well, it did. It just too 2 decades to do that instead of the few years that futurists predicted. But with a higher density of connectivity today, innovations can roll out faster.

But I think that’s where the answer is. If you are in the business of social, then being on that wave and pushing the edge is good. If your business is BigData tools and technologies, then yes, you need to invest and recognize that it will be worth it in the long run if you survive. But many brands (companies) can just strive to keep pace and do okay. There are exceptions of course. Product quality and features still dominate purchase decisions. Yes people are moved by viral videos and bad customer service or bad products, but companies with long-standing brands whose products are core, can afford to spend to meet expectations versus having to over invest. They will do just fine keeping pace and continuing to focus on the product as well as marketing. For example, does Coca-Cola expect to double market share because they are better at social media than others? Will they grow significantly because of it? Its not clear but for some segment of products, the spending pattern does not have to be extraordinary. It just needs to keep pace and be reasonable.

This gets us back to the question of social media, BigData and Effort-Return. Effort-Return is important to calculate because brands should not over invest. They need to manage their investments. Is social media and BigData worth the investment? Absolutely, its really a question of degree.

Largescale Healthcare sensor networks and BigData

Lately, there have been announcements that could make large-scale, healthcare focused sensor network much more of a reality. A healthcare monitoring network could drive substantial improvements in care and reductions in cost. Today, if you are in a hospital, you are plugged into the sensor network that is relatively stationary and highly controlled (for obvious reasons). But there are many more healthcare, consumer-level networks that could be created. Here’s a mention of the world’s smallest blood monitoring implant and other heart rate monitoring capabilities based on visual monitoring techniques:

Putting together a big data solution here means a solution that can scale out. Batch technologies are not the answer here so frameworks like hadoop directly are not the primary component. Other analytical frameworks like Storm, Dempsy, Apache S4, Esper, OpenMDAO, Stormkeeper or the eclipse m2m framework are needed.

In this case, BigData is about scaling out solutions for sensor networks and piecing together analytical processing nodes to create a workflow that accomplishes the analysis.

But healthcare sensor networks are not without their challenges. Here’s some links that describe the issues in more detail and the research going on in this area