From the beginning of the Hadoop’s history, MapReduce has been the complete game changer in town when it comes to deal with data processing. The availability of MapReduce has been the primary reason behind the success of Hadoop and at the meantime is a major factor in limiting further adoption.
MapReduce empowers skilled programmers and coders to write distributed applications without considering the complexities involved in the underlying distributed computing infrastructure. This is a great deal: Hadoop and the MapReduce framework resolves all sorts of complexity that application developers don’t need to manage. For instance, the ability to transparently scale out the cluster by adding nodes and the automated failover mechanism for both data storage and data processing sub-tasks happen with almost zero impact on applications.
The other face of the coin here is that, however MapReduce hides and manages a tremendous amount of complexity, we can’t afford to forget what it is: an interface for parallel programming. This is a very useful skill — and act as a barrier to wider adoption. There are rarely good MapReduce programmers in the industry, and not everyone got the skill to master it.
In Hadoop’s beginning days (Hadoop 1 and before), we could able top only run MapReduce applications on the clusters. But in Hadoop 2, the YARN component changed every aspect by taking over resource management and scheduling from the MapReduce framework, and enables a generic interface to facilitate applications to run on a Hadoop cluster. To conclude, this means MapReduce is now just one of many application frameworks we can use as a tool to develop and run applications on Hadoop. Although it’s definitely possible to run applications using other frameworks on Hadoop, but it doesn’t mean that we can start forgetting about MapReduce. At the time, MapReduce is still the only production-ready data processing framework available for Hadoop.
Although other useful frameworks will eventually become available, MapReduce is comprised of almost a decade of maturity under its belt (with almost 4,000 JIRA issues completed, involving hundreds of developers, if we still keeping track). There’s no argument: MapReduce is Hadoop’s most mature and powerful framework for data processing. In addition, a significant amount of MapReduce code is now in use and is unlikely to go anywhere soon. Long story short: MapReduce is an important part of the Hadoop story.
Developing the parallel strategies requires an altogether different approach. What’s more, as the computation problems get even a little more difficult, they become even harder when they need to be parallelized. For wider-ranging complex jobs such as statistical processing or text extraction, and especially for processing unstructured data, we need to use MapReduce.
Allen Scott
28-Apr-2017