The massive data volumes that are very command in a typical Hadoop deployment make compression a necessity. Data compression definitely saves us a great deal of storage space and it also makes sure to accelerate the movement of that data throughout our cluster. It’s not a big surprise that a numerous available compression schemes, called codecs, are out there for us to consider.
In a Hadoop deployment, we are dealing (potentially) with quite a big number of individual slave nodes, each one of them consists of a number of large disk drives. It’s very usual that a particular slave node to have more than 45 terabytes of raw storage space available for HDFS. Besides this, Hadoop slave nodes are engineered to be inexpensive, they’re not free, and with massive amount of data that have a tendency to expand at increasing rates, compression is a must use tool to control extreme data volumes.
Let’s learn some basic compression terminologies:
Codecs:
A codec, which is acronym for compressor/ decompressor, is technology (software or hardware, or both) for compressing and decompressing data; it’s basically an implementation of a compression/decompression algorithm. We need to know that some codecs support something known as split table compression and that codecs varies in both the speed with which they can compress and decompress data and the degree to which they can compress it.
Split table compression:
Split table compression is one of the key concepts in a Hadoop context. The way Hadoop works is that files are broken down and split if they are larger than the file’s block size setting, and particular file splits can be processed in parallel by distinctive mappers. With most of the codecs, text file splits would not be decompressed independently of other splits from the same file, so those codecs are said to be nonsplittable, so MapReduce processing is very limited to a single mapper. Since, the file can be decompressed only as a whole, and not as individual parts based on splits, this might possible that, there can be no parallel processing of such a file, and performance might take a huge hit as a job waits for a single mapper to process multiple data blocks that can’t be decompressed independently.
Split table compression is only a factor while we are with dealing text files. For binary files, Hadoop compression codecs compress data within a binary-encoded container, depending on the file type (for example, a Sequence File, Avro, or Protocol Buffer).
Leave Comment