Mapper class is responsible for providing implementations for mapping jobs in MapReduce. To implement mapping logic we need to inherit this Mapper class. Here we see the important methods in Mapper class.
map method:
This is a cut-down view of the base Mapper class provided by Hadoop. For our own mapper implementations, we will subclass this base class and override the specified method as follows:
class Mapper<K1, V1, K2, V2>
{
void map(K1 key, V1 value, Mapper.Context context)
throws IOException, InterruptedException
{
// our code goes here……..
……………………………
……………………………
}
}
Although the use of Java generics can make this look a little opaque at first, there is actually not that much going on. The class is defined in terms of the key/value input and output types, and then the map method takes an input key/value pair in its parameters. The other parameter is an instance of the Context class that provides various mechanisms to communicate with the Hadoop framework, one of which is to output the results of a map or reduce method. Apart from map method,
there are three additional methods that sometimes may be required to be
overridden.
setup method:
cleanup method:
run method:
protected void setup( Mapper.Context context)
throws IOException, Interrupted Exception
This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing.
protected void cleanup( Mapper.Context context)
throws IOException, Interrupted Exception
This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing.
protected void run( Mapper.Context context)
throws IOException, Interrupted Exception
This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method.
Notice that the map method only refers to a single instance of K1 and V1 key/value pairs. This is a critical aspect of the MapReduce paradigm in which we write classes that process single records and the framework is responsible for all the work required to turn an enormous data set into a stream of key/ value pairs. We will never have to write map or reduce classes that try to deal with the full data set. Hadoop also provides mechanisms through its InputFormat and OutputFormat classes that provide implementations of common file formats and likewise remove the need of having to write file parsers for any but custom file types.
Royce Roy
28-Apr-2017