Since I am “recovering” data scientist, I thought that once in awhile, it would be good deviate from my more management consulting articles and eyeball the bigdata landscape to see if something interesting has happened.
What!?! It seems like you cannot read an article without encountering yet another treatise on bigdata or at the very least, descriptions of the “internet of things.”
That’s true, but if you look under the hood, the most important benefits of the bigdata revolution have really been on two fronts. First, recent bigdata technologies have decreased the cost of analytics and this makes analytics more easily available to smaller companies. Second, the bigadata bandwagon has increased awareness that analytics are needed to run the business. Large companies could long afford the investments in analytics which made corporate size an important competitive attribute. The benefits from analytics should not lead to a blanket and unthoughtful endorsement of analytics. Not every business process, product or channel needs overwhelming analytics. You want, however, analytics to be part of the standard toolkit for managing the value chain process and decision making.
The ability to process large amounts of data, beyond what mainframes could do, has been with us for years-twenty to thirty years The algorithms developed decades ago are similar to the algorithms and processing schemes pushed in the bigdata world today. Teradata helped created the MPP database and SQL world. AbInitio (still available) and Torrent (with their Orchestrate product sold to IBM eventually) defined the pipeline parallelism and data parallelism data processing toolchain world. Many of the engineers at these two ETL companies came from Thinking Machines. The MPI API defined parallel processing for the scientific world (and before that PVM and before that…).
All of these technologies were available decades ago. Mapreduce is really an old lisp concept of map and fold which was available in parallel from Thinking Machines even earlier. Today’s tools build on the paradigms that these companies created in the first pass of commercialization. As you would expect, these companies built on what had occurred before them. For example, parallel filesystems have been around for a long time and were present on day one in those processing tools mentioned above.
Now that the hype around mapreduce is declining and its limitations are finally becoming widely understood, people recognize that mapreduce is just one of several parallel processing approaches. Free from the mapreduce-like thinking, bigdata toolchains can finally get down to business. The bigdata toolchains realize that sql query expressions are a good way to express computations. Sql query capabilities are solidly available in most bigdata environments. Technically, many of the bigdata tools provide “manual” infrastructure to build the equivalent sql commands. That is, they provide the parsing, planning and distribution of the queries to independent processing nodes.
I consider the current bigdata “spin” that started a about 1-2 years ago healthy because it increased the value of other processing schemes such as streaming, real-time query interaction and graphs. To accommodate these processing approaches, the bigdata toolchains have changed significantly. Think SIMD, MIMD, SIPD and all the different variations.
I think the framework developers have realized that these other processing approaches require a general purpose parallel execution engine. An engine that AbInitio and others have had for decades. You need to be able to execute programs using a variety of processing algorithms where you think of the “nodes” as running different types of computations and not just a single mapreduce job. You need general purpose pipeline and data parallelism.
We see this in the following open-source’ish projects:
- Hadoop now as a real resource and job management subsystem that is a more general parallel job scheduling tool. It is now useful for more genera parallel programming.
- Apache Tez helps you build general jobs (for hadoop).
- Apache Flink builds pipeline and data parallel jobs. Its also a general purpose engine e.g. streaming, …
- Apache Spark builds pipeline and data parallel jobs. Its also a general purpose engine e.g. streaming, ..
- Apache Cascading/Scalding builds pipeline and data parallel jobs, etc.
- DataTorrent: streaming and more.
- Storm: Streaming
- Kafka: Messaging (with persistency)
- Scrunch: Based on apache crunch, builds processing pipelines
- …many of the above available as PaaS on AWS or Azure…
I skipped many others of course and I am completely skipping some of the early sql-ish systems such as hive and I have skipped visualization, which I’ll hit in another article. Some of these have been around for a few years in various stages of maturity. Most of these implement pipeline parallelism and data parallelism for creating general processing graphs and some provide sql support where that processing approach makes sense.
In addition the underlying engines, what’s new? I think some very important elements: usability. The tools are a heck-of-alot easier to use now. Here’s why.
What made the early-stage (20-30 years ago) parallel processing tools easier to use was that they recognized, due to their experience in the parallel world, that usability by programmers was key. While it is actually fairly easy to get inexpensive scientific and programming talent, programming parallel systems has always been hard. It needs to be easier.
In the original toolsets from the vendors mentioned earlier in this article, you would program in C++ or a special purpose data management language. It worked fine for those companies who could afford the talent that could master that model. In contrast to today, you can use languages like python or scala to run the workflows and use the language itself to express the computations. The language itself is expressive enough that you are not using the programming environment as a “library” that you make programming calls to. The language constructs are translated into the parallel constructs transparently. The newer languages, like lisp of yore, are more functionally oriented. Functional programming languages come with a variety of capabilities that makes this possible. This was the prize that HPF and C* were trying to win. Specialized languages are still being developed that help specify parallelism and data locality without being “embedded” in other modern languages and they to can make it easier to use the new bigdata capabilities.
While the runtimes of these embedded parallel capabilities are still fairly immature in a variety of ways. Using embedded expressions, data scientists can use familiar toolchains, languages and other components to create their analytical workflows easier. Since the new runtimes allow more than just mapreduce, streaming, machine learning and other data mining approaches suddenly becomes much more accessible at large scale in more ways than just using other tools like R and others.
This is actually extremely important. Today’s compute infrastructure should not be built with rigid assumptions about tools, but be “floatable” to new environments where the pace of innovation is strong. New execution engines are being deployed at a fantastic rate and you want to be able to use them to obtain processing advantages. You can only do that if you are using well known tools and technologies and if you have engineered your data (through data governance) to be portable to these environments that often live in the cloud. It is through this approach that you can obtain flexibility.
I won’t provide any examples here, but lookup the web pages for storm and flink for examples. Since sql-like query engines are now available in these environments, this also contributes to the user-friendliness.
Three critical elements are now in play: cost effectiveness, usability and generalness.
Now its a party.