articles

Home / DeveloperSection / Articles / HBase Data Model: Column Qualifiers and Versioning (Part – 2)

HBase Data Model: Column Qualifiers and Versioning (Part – 2)

Jayden Bell3674 10-May-2016

We have already examined dos and don’ts of Row Keys and Column Family with details in our previous post. Now let’s move on a bit and seen what are column qualifiers and how versioning works in HBase (Of course, we will continue with the data set as in previous post

 

Row Key
Column Family :{Column Qualifier: Version: Value}

001

CustomerName: { ‘FN’: 1383859182496:‘Sheldon’,

                                 ‘LN’: 1383859182858:’Cooper’,

                                ‘MN’: 1383859183001:’Wills’,

                                ‘MN’: 1383859182915:’W’}

ContactInfo: {‘EA’: 1383859183030:‘sh.cooper@mindstick.com’,

                         ’SA’: 1383859183073: ’45 LT NY’}

002

CustomerName: {‘FN’: 1383859183103:‘Hank’,

                                ‘LN’: 1383859183163:‘Moody’,

ContactInfo: {

                        ’SA’: 1383859185577: ‘16 TL CA’}

 Column Qualifiers

Column qualifiers are specific names assigned to our data values in order to make sure we are able to accurately identify them. Unlike column families, column qualifiers can be virtually unlimited in content, length and number. If we omit the column qualifier, the HBase system will assign one for you. Printable characters are not needed, so any type and number of bytes can be used here to create a column qualifier. Since the number of column qualifiers is variable, new data can be added to column families on the fly, making HBase much more flexible and highly scalable. However, there’s a cost to consider: HBase stores the column qualifier with our value (because it’s actually part of the key), and because HBase doesn’t limit the number of column qualifiers we can have, creating long column qualifiers can be quite costly in terms of storage. That’s why we prefer to abbreviate the column qualifiers in the above Table (for instance, “LN:” was used instead of “LastName” and, “SA”: stands for street address).

Notice in our logical representation of the customer contact information in HBase that the system is taking benefits of sparse data support in the case of Hank Moody. Assuming this table represents customer contact information from a  software service company like Mindstick, the company isn’t too worried about Hank’s middle name (abbreviated ‘MN’) and e-mail addresses (abbreviated ‘EA’) now, but hopes to (progressively) gather that information over time.

Versions

Looking back at the above table, we can see a number between the column qualifier and value (‘FN’: 1383859182496:‘Sheldon,’ for example). That number represents the version number for each value in the table. Values stored in HBase are time stamped by default, which means we have a way to identify different versions of our data right out of the box. It’s possible to create a custom versioning scheme, but developers usually go with a time stamp created using the current Unix time. (The Unix time or Unix epoch represents the number of milliseconds since midnight January 1, 1970 UTC.) The versioned data is stored in decreasing order, so that in this way, the most recent value is returned by default unless a query specifies a particular timestamp. We can see in above Table that our service provider company MindStick at first only had an initial for Sheldon Cooper’s middle name but then later on they learned that the “W” stood for “Wills.” The most recent value for the ‘MN’ column is stored first in the table.

We can also set a limit on the amount of time that data can stay in HBase using a variable known as time to live (TTL). Furthermore, we can also set a variable which controls the number of versions per value. This can be done per column family.


Updated 31-Mar-2019

Leave Comment

Comments

Liked By