Posts Tagged ‘BigData’

Hadoop 2.0 goes beyond MapReduce to create a general framework for distributed data-processing applications


The new Hadoop is nothing less than the Apache Foundation’s attempt to create a whole new general framework for the way big data can be stored, mined, and processed.

It ought to not only further stimulate the way apps are written for Hadoop, but also allow for the creation of entirely new data-crunching methodologies within Hadoop that simply weren’t possible because of its earlier architectural limitations. In short, it’s good stuff.

What’s been holding Hadoop back all this time? More important, where’s it going from here?

Various criticisms of Hadoop have revolved around its scaling limitations, but the biggest constraint on scale has been its job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck.

With Hadoop 2, the JobTracker approach has been scrapped. Instead, Hadoop uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what’s happening on that node. (Each running application also has its own governor, ApplicationMaster.)

This setup is so unlike the previous MapReduce that Apache gave it an entirely new name: YARN, or Yet Another Resource Negotiator, with the new MapReduce running as one of many possible components for it. In fact, Apache claims that any distributed application can run on YARN, albeit with some porting. To that end, Apache’s maintained a list of YARN-compatible applications, such as the social-graph analysis system Apache Giraph (which Facebook uses). More are on the way from other parties, too.

As radical as this approach is, Apache wisely decided not to break backward compatibility, so MapReduce 2 still has the same APIs as its predecessor. Existing jobs just need a recompile to work properly.

It’s also hardly coincidence that YARN makes Hadoop far more cross-compatible with other Apache projects for massaging big data. Use one, and it becomes far easier to use the rest. Such a rising tide for Hadoop would help lift all of Apache’s related boats.

The biggest win of all here is how MapReduce itself becomes just one possible way of many to mine data through Hadoop. Apache’s own Spark, another candidate for porting to YARN, might be better suited to some kinds of work than MapReduce, so Hadoop 2 gives you more flexibility to choose the engine that’s the best fit.

The two big Hadoop vendors, Cloudera and Hortonworks, both have their own discussions of how YARN is crucial stuff, even if they approach Hadoop from markedly different directions. Cloudera’s Impala offers the ability to run low-latency SQL queries against HDFS-stored data, which makes them best suited to live analytics; Hortonworks has chosen to go with Apache’s native Hive technology, which is best for data warehouse operations (like long-running queries with lots of join-type operations).

Porting apps to YARN isn’t a trivial effort, though, so the payoff involved in reworking Hadoop this radically will be strongly based on how much gets deployed within the new framework. But the fact that both Cloudera and Hortonworks are solidly behind Hadoop 2 and haven’t forked the product — or stuck with its earlier iterations — is major evidence Hadoop 2 isn’t just smoke or mirrors. Or tangled yarn.