Pig Latin has a simple syntax with powerful semantics we will use to carry out two primary operations:
· Access data and
· Transform data.
If we compare the Pig implementation with the Java MapReduce implementations, they both come up with the same result but the Pig implementation has a lot less code and is easier to understand.
In a Hadoop context,
·Accessing data means: allowing developers to load, store, and stream data, whereas
·Transforming data means: taking advantage of Pig’s ability to group, join, combine, split, filter, and sort data.
Data Access
Data Transformations
Debug
For data access operations, Pig has seven useful operators for this very purpose:
1. LOAD/STORE: Read and Write data to file system
2. DUMP :Write output to standard output (stdout)
3. STREAM: Send all records through external binary
4. FOREACH: Apply expression to each record and output one or more records
5. FILTER: Apply predicate and remove records that don’t meet condition
6. GROUP/COGROUP: Aggregate records with the same key from one or more inputs
7. JOIN: Join two or more records based on acondition
As we seen data access operations, our data transformation operations can be
easily done by following six operators:
1. CROSS: Cartesian product of two or more inputs
2. ORDER: Sort records based on key
3. DISTINCT: Remove duplicate records
4. UNION: Merge two data sets
5. SPLIT: Divide data into two or more bags based on predicate
6. LIMIT: subset the number of records
Apart from Data access and transformations, Pig also provide operators that are
helpful for debugging and troubleshooting
1. DESCRIBE: Return the schema of a relation.
2. DUMP: Dump the contents of a relation to the screen.
3. EXPLAIN: Display the MapReduce execution plans.
Part of the paradigm shift of Hadoop is that we apply our schema at Read instead of Load. According to the old way of doing things — the RDBMS way — when we load data into our database system, we must load it into a well-defined set of tables. Hadoop allows us to store all that raw data upfront and apply the schema at Read. With Pig, we do this during the loading of the data, with the help of the LOAD operator.
The optional USING statement defines how to map the data structure within the file to the Pig data model — in this case, the PigStorage () data structure, which parses delimited text files. (This part of the USING statement is often referred to as a LOAD Func and works in a fashion similar to a custom de-serializer.)
The optional AS clause defines a schema for the data that is being mapped. If we don’t use an AS clause, we are basically telling the default LOAD Func to expect a plain text file that is tab delimited. With no schema provided, the fields must be referenced by position because no name is defined.
Using AS clauses means that we have a schema in place at read-time for our text files, which allows users to get started quickly and provides agile schema modeling and flexibility so that we can add more data to our analytics.
The LOAD operator operates on the principle of lazy evaluation, also referred to as call-by-need. Now lazy doesn’t sound particularly praiseworthy, but all it means is that we delay the evaluation of an expression until we really need it. In the context of our Pig example, that means that after the LOAD statement is executed, no data is moved — nothing gets shunted around — until a statement to write data is encountered. We can have a Pig script that is a page long filled with complex transformations, but nothing gets executed until the DUMP or STORE statement is encountered.
Allen Scott
15-Apr-2017