[FoRK] Big Data

J. Andrew Rogers andrew at jarbox.org
Sat Feb 4 22:33:28 PST 2012

On Feb 4, 2012, at 4:19 PM, Noon Silk wrote:
> On Sun, Feb 5, 2012 at 7:28 AM, J. Andrew Rogers <andrew at jarbox.org> wrote:
>> No, Tableau is a visualization system some other system would have to crunch the data model
>> first.  These types of analytics can really only be done in a massively parallel analytical database
>> supporting the required operations. None of the usual NoSQL and Hadoop clones can do this.
> Exactly what types of operations are you referring to?

There are two key families of algorithms that do not parallelize well on classic BigTable-like architectures such as Hadoop. We isolated these as the primary root causes of application non-scalability on these types of architectures in 2006. The theoretical explanations for why these algorithm families do not parallelize is straightforward. For advanced analytics, such as what you might want to do for predictive behavioral modeling, operations depending on massively parallel versions of these algorithms are ubiquitous.

The two computer science problems:

Generalized Interval Indexing. Basically, indexing data types that are not point-like in a metric space so that they are content-addressable, both for storage and for qualified retrieval (e.g. all polygons that intersect a given polygon). The well-known use cases are geospatial, temporal, and complex event processing. How do you partition data sets that cannot be partitioned by value nor hashed?

Generalized Join Parallelism. Can you scale an ad hoc recursive self-join over a massively distributed data set? The parallelism in practice is far worse than even a casual analysis of the standard algorithms suggest. Unfortunately, anything that looks remotely like graph analysis is inherently dependent on these algorithms, never mind more conventional relational database join operations. 

Unfortunately, in 2012 Hadoop is frequently sold as though these limits exist due to lack of cleverness on the part of the user. Back in 2006 a small number of organizations started doing R&D on massively scalable data structures and algorithms without these limitations. Some of that R&D paid off after years of research; I am aware of at least two independent solutions that are actively being productized. Meanwhile, the big data world is going bonkers over Hadoop.

Hadoop is greatly overrated as a big data platform. It was designed to fight the last war and is hopelessly unprepared to fight the next one. It is still a good tool if your application can fit within its constraints. 

J. Andrew Rogers
Twitter: @jandrewrogers

More information about the FoRK mailing list