Input Splits:
As we already know that in Hadoop, files are composed of individual records, which are ultimately processed one-by-one by mapper tasks. Now, remember that the block size for the Hadoop cluster is 64 megabytes, which means that the light data files are broken down into chunks of exactly 64 megabytes.
Are you able to see the problem? If every map job processes all records in a particular data block, what will be going to happens to those records that span block boundaries? File blocks are exactly 64 megabytes (or whatever we set the block size to be), and since HDFS has no conception of what’s inside the file data blocks, it can’t gauge when a record might spill over into next block. To get rid of this problem, Hadoop uses a logical representation of the data stored in file blocks, called as input splits. When a MapReduce task client calculates the input splits, it then identifies where the first whole record in a block begins and where the last record in the block ends. In cases where the ending record in a block is incomplete, the input split provides location information for the next block and the byte offset of the data required to complete the record.
MapReduce data processing is led by this concept of input splits. The number of input splits that are calculated for a particular application identifies the number of mapper tasks. Each of these mapper jobs is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker in Hadoop 1) does its best to ensure that input splits are processed locally.
Map-Reduce as a series of key-value transformations:
We may have seen MapReduce expressed in terms of key/value transformations, in particular the intimidating one look exactly like this:
{K1, V1} -> {K2, List<V2>} -> {K3, V3}
Now we need to understand what these transformations means:
· The input to the map method of a MapReduce task is a series of key/value pairs (input splits) that we'll call K1 and V1.
· The output of this map method (and which intern is an input to the reduce method) is a series of keys and an associated list of values that we are going to call K2 and V2. Note that each mapper simply outputs a series of individual key/value results; these are combined togehter into a key and list of values in the shuffle method.
· The final result of the MapReduce job is another series of key/value pairs, which we call K3 and V3.
These data sets of key/value pairs don't have to be distinctive; it would be quite possible to input, for example, names and contact details and output the same, with perhaps some intermediary format used in collating the information.
Jonas Stuart
28-Apr-2017