Thursday, December 29, 2011

Introduction to HBase

Sometimes people ask me: "What is HBase?" It's hard to give a concise answer.

There is a lot of information about HBase, but I have not been able to find a good and short introduction to HBase, yet. I find them either to short or too long.

So here's my attempt at YAITH (yet another introduction to HBase):

Definition
HBase is a key/value store. Specifically it is a
Sparse, Consistent, Distributed, Multidimensional, Sorted map.

Sounds extremely fancy... Let's explore what this means.

Map
HBase maintains maps of Keys to Values (key -> value). Each of these mappings is called a "KeyValue" or a "Cell". You can find a value by its key... That's it.

Sorted
These cells are sorted by the key. This is a very important property as it allows for searching ("give me all values for which the key is between X and Y"), rather than just retrieving a value for a known key.

Multidimensional
The key itself has structure. Each key consists of the following parts:
row-key, column family, column, and time-stamp.
So the mapping is actually:
(rowkey, column family, column, timestamp) -> value
rowkey and value are just bytes (column family needs to be printable), so you can store anything that you can serialize into a byte[] into a cell.

Sparse
This follows from the fact the HBase stores key -> value mappings and that a "row" is nothing more than a grouping of these mappings (identified by the rowkey mentioned above).
Unlike NULL in most relational databases, no storage is needed for absent information, there will be just no cell for a column that does not have any value.
It also means that every value carries all its coordinates with it.

Distributed
One key feature of HBase is that the data can be spread over 100s or 1000s of machines and reach billions of cells. HBase manages the load balancing automatically.

Consistent 
HBase makes two guarantees:
All changes the with the same rowkey (see Multidimensional above) are atomic. A reader will always read the last written (and committed) values.

Storage
HBase partitions the key space. Each partition is called a Table. Each table declares one or more column families. Column families define the storage properties for an arbitrary set of columns. Columns are not declared, they are essentially just an additional label for the value.

The principle operations supported by HBase are Put (add some data), Delete ("delete" some data), Scan (retrieve some cells), Get (which is just a special case of Scan). No queries, no secondary indexes...

That's it in a nutshell. If you looked for a quick summary you can stop here.
Below I will go into a bit more detail and list some implications stemming from the architecture.

Details
Let's drill deeper into the multidimensional aspect of HBase. As mentioned above HBase maps (rowkey, column family, column, timestamp) to a value.

The rowkey is defined by the application. As the combined key is prefixed by the rowkey this allows the application to define the desired sort order. Defining the right sort order is extremely important as scanning is the only way retrieve any value for which the key is not known a priori.

The rowkey also provides a logical grouping of cells; and HBase ensures that all cells with the same rowkey are co-located on the same server (called a RegionServer in HBase), which allows for ACID guarantees for updates with the same rowkey without complicated and slow two-phase-commit or paxos.

Column families are declared when a table is created. They define storage attributes such as compression, number of versions to maintain, time to live, and minimum number of versions - among others.

Columns are arbitrary names (or labels) assigned by the application.

The timestamp is a long identifying (by default) the creation time of the of the cell. Each cell (as opposed to row) is versioned, which makes it interesting to reason about consistency and ACID guarantees (more on that later). No data is ever overwritten or changed in place, instead every "update" creates a new version of the affected set of cells.

You should not think of a row like this:
(key, column value 1, column value 2, NULL, column value 4, ...)
But rather like this:
(rowkey, column family, column1, timestamp) -> column value 1
(rowkey, column family, column2, timestamp) -> column value 2
(rowkey, column family, column4, timestamp) -> column value 4
...

So what about consistency?

HBase ensures that all new versions created by single Put operation for a particular rowkey are either all seen by other clients or seen by none.
Furthermore a Get or Scan will only return a combination of versions of cells for a row that existed together at some point. This ensures that no client will ever see a partially completed update or delete.
HBase achieves this by using a variation of Multi Version Currency Control (mvcc).

What about storage size?

As said above, HBase stores cells. Cells are grouped by a rowkey into something that looks like a row. Since cells are stored individually the storage is sparse.

If your data is naturally sparse (i.e. your model has many columns, but each row only has a values for a few of those) that works great, there is no space wasted for for those columns for which a row has no value.

If your model is dense on the other hand (i.e. all columns have a value in each row) a lot of space it wasted as each column stores its full coordinates (which maybe many times larger than the actual column value).

Since coordinates are very repetitive, the supported compression mitigates the wasted space to some extent.

Of course, since the value of a cell is just an array of bytes, many dense columns can be serialized and stored into a single value of a single cell.

What about querying?

You can only Scan a range of cells or Get a set of cells.
A scan typically defines a start-rowkey and stop-rowkey (otherwise it would be a full table scan, with a runtime proportional to the size of the table).

In stores like HBase the data is typically denormalized at the time of writing, rather than sliced and diced at reading time.

How is the data physically stored?

Puts and Deletes are collected into an in-memory structure called the MemStore. Before the MemStore is update the changes are written to a Write Ahead Log (WAL) to enable recovery in case a server crashes.
When it reaches a certain size the MemStore is flushed to disk into StoreFile.

Periodically StoreFiles are compacted into fewer StoreFiles.

For reading and writing HBase employs Log Structured Mergetrees, which is just a fancy way of saying that reading and compacting in HBase is performing a merge sort (a scan looks at the heads of all StoreFiles and the Memstore and picks the smallest element first, in case of a Scan it is returned to the client, in case of a compacted it is written to the new StoreFile).

There is a lot more to say, but I set out saying that this will be short.


More information can be found in the HBase Book.

Thursday, December 22, 2011

HBase "Raw" scans

In HBase Scans (and by extension Gets) do not retrieve deleted cells or the tombstone markers that mark them as deleted.
Sometimes is useful for trouble shooting (or backup - there will be a separate blog post about that soon) to see all cells including deleted cells and the tombstone markers.

HBASE-4536 introduces "raw" Scans (only available in HBase trunk - not the upcoming 0.92). In the Java client these are enabled by Scan.setRaw(true).

The HBase shell also supports this by adding RAW=>true to a scan.

Once raw mode is enabled the returned result contains not only the standard KeyValues, but also KeyValues for deleted cells and for tombstone markers (which are just special types of KeyValues, more on delete markers can be found here).

Here's an example of what it would look like in the shell:
hbase(main):001:0> scan 'x2', {RAW=>true, VERSIONS=>10}
ROW                   COLUMN+CELL                                               
 r1                   column=f:c, timestamp=1323323611106, value=v3             
 r1                   column=f:c, timestamp=1323323609988, type=DeleteColumn    
 r1                   column=f:c, timestamp=1323323609988, value=v2             
 r1                   column=f:c, timestamp=1323323608554, value=v1             
 r2                   column=f:c, timestamp=1323323617759, value=v3             
 r2                   column=f:c, timestamp=1323323616226, value=v2             
 r2                   column=f:c, timestamp=1323323614496, value=v1             
2 row(s) in 0.6380 seconds
In this the above example values 'v2' and 'v1' for row key 'r1' have been deleted with a column delete marker.
hbase(main):005:0> scan 'x1', {RAW=>true, VERSIONS=>10}
ROW                   COLUMN+CELL                                               
 r2                   column=f:, timestamp=1323323616226, type=DeleteFamily     
 r2                   column=f:c, timestamp=1323323617759, value=v3             
 r2                   column=f:c, timestamp=1323323616226, value=v2             
 r2                   column=f:c, timestamp=1323323614496, value=v1             
2 row(s) in 0.0500 seconds

Here 'v2' and 'v1' of row key 'r2' have been deleted with family delete marker.

Notice how the column marker is sorted in line with the cells it affects (it sorted after the cell for value 'v3'), but that the family marker is sorted before all cell of the affected row key.
The sort order was carefully designed to allow HBase to identify all cells affected by a delete marker in single forward scan through the store files(s).

Wednesday, December 21, 2011

HBase data rentention options

Every key value (or cell) in HBase is versioned. Data is never changed in place, but a new version is created for every change.

HBase periodically cleans up old or expired versions when the memstore is flushed to disk (see HBASE-4241) and during periodic minor and major compactions. I will refer to all three events as "compaction" below.

HBase has two principal knobs to declare how much data you would like to retain.
  1. Number of versions
    when this number of versions (for a cell!) is reached older versions will be deleted during the next compaction.
  2. Time To Live (TTL)
    when cells are older than the TTL they will be removed during the next compactions.
Using the setMaxVersions(...) and setTimeRange(...) methods on the Get and Scan objects allows an application to decide what version it would like to see.

Now, what happens to cells that are in principle expired or beyond the maximum number of versions before HBase had a chance to collect them?

The answer is that in that case any Scan or Get issued by a client will automatically filter these cells. For all practical purposes they are gone.

Another interesting question is: What happens when TTL is enabled for a column family and the last version of a cell expires?

In that case that last version is also deleted, leaving no version of the cell in question.

For certain backup scenarios it would useful to set a TTL, but at least keep a certain number of versions around. So for cells within the TTL range one can a fully restore of any previous state (provided enough versions are stored) and at the same time it is always possible to restore the last N versions.

HBASE-4071 provides such a feature. It adds the ability to declare a minimum number of versions to keep.
Together with HBASE-4536 - described here, it is possible to design a fairly elaborate data retention policy for primary and replicated HBase stores.

For example it is possible to say:
  • expire cells after one week
  • keep at least two versions around
  • but not more than 100 versions
  • (with HBASE-4536) also keep deleted cells until they expire
Together with HBase replication this can used as an effective way to keep backups of historical data.

Deletion in HBase

When a Delete command is issued through the HBase client, no data is actually deleted. Instead a tombstone marker is set, making the deleted cells effectively invisible.

User Scans and Gets automatically filter deleted cells until they get removed.
HBase periodically removes deleted cells during compactions.

The tombstone markers are only deleted during major compactions (which compacts all store files to a single one), because in order to prove that a tombstone marker has no effect HBase needs to look at all cells.

There are three types of tombstone markers:
  1. version delete marker
    Marks a single version of a column for deletion
  2. column delete marker
    Marks all versions of a column for deletion
  3. family delete marker
    Marks all versions of all columns for a column family for deletion
It is also possible to add a maximum time stamp to column and family delete markers, in which case only versions with a lower timestamp are affected by the delete marker.

HBase allows to perform timerange queries in order to see only the versions in a specified range of time. For example to see the data "as of time T" the range would be set to [0,T+1) (T+1, because in HBase the end time is exclusive).

There is one snag, though. Once a delete marker is set, all cells affected by that marker are no longer visible. If a Put for a column C was issued at time T and is followed by a column delete at time T+X, issuing a time range scan for [0, T+1) will return no data, as deleted cells are never shown.

HBASE-4536 addresses that issue. It is now possible to instruct a column family to retain deleted cells and treat them exactly like ordinary undelete cells (which means they will still contribute to version counts, and can expire with a TTL was set for the column family). This can be done in the Java client by calling HColumnDescriptor.setKeepDeletedCells(true) or through the HBase shell by setting KEEP_DELETED_CELLS=>true for a column family.

When this setting is enabled for a column family, deleted cells are visible to time range scans and gets as long as the requested range does not include the delete marker.

So in the case above a Scan or Get for [0, T+1) will return the Put that was marked as deleted. A Scan or Get for the range [0, T+X+1) will not return the Put as the range does include the delete marker.

This is very useful to provide full "as-of time" queries, for example on back up replicas for production data in case a user accidentally deleted some data.

Friday, December 16, 2011

Long running HBase clients

The default HBase client was not designed to be used in long running, multithreaded Application Servers.

An HTable object is not thread safe, so the application either has to cache HTables - possibly in an HTablePool, or create a new one for each use.

HTables are not lightweight objects. Under the hood an ExecutorService is created for each HTable to parallelize requests to multiple regionservers.

To make matters even more complicated, the connection to an HBase cluster is managed by an HConnectionImplementation, which maintains a pool of TCP connections per regionserver. Each TCP connection managed by an HConnectionImplementation has its own thread.
HConnections are created and cached on demand on behalf of HTables.

So when you use HTablePool you get a pool of HTables, each maintaining its own pool of threads along with a (potentially shared) HConnection, which in turn maintains a pool of TCP connections, each with its own thread.

This setting becomes inscrutable quickly.

In HBASE-4805 I propose a different way of looking at this setup.

Instead of creating HTables - and implicitly thread pools and HConnections, you can now (optionally) create and manage your HConnections to the HBase cluster directly and create very lightweight HTable objects when needed by a thread for a few operations:

So the Application Server would create HConnection(s) and ExecutorService(s) ahead of time and reuse them with many HTables.

Configuration conf = HConfiguration.create();
HConnection conn = HConnectionManager.createConnection(conf);
ExecutorService threadPool = ...

...

HTable t = new HTable("<tablename>", conn, threadPool);
t.put(...);

A quick tests I performed suggests that it is actually cheaper to create an HTable this way than it is to retrieve one from an HTablePool.

Eventually the ExecutorService could be managed by the HConnection as well, and HTables are then simply retrieved with HConnection.getHTable(<tableName>)... But it is start for those of us who use an HBase client inside an Application Server.