Wednesday, August 1, 2012

Pluggable compaction and scanning policies

As I described here, HBase maintains multiple versions of all key-values (KVs) stored and has two essential knobs to control the collection old versions: TimeToLive (TTL) and MaximumNumberOfVersion (versions).

These two attributes are "statically" defined for each column family. To change these the table needs to be disabled, the column families changed, and then the table needs to be enabled again. A very heavy weight operation.

An area where this is problematic is M/R based, timestamp consistent, incremental backups (a scheme that I outlined here). Another scenario might be MVCC based transaction engine on top of HBase that use the HBase timestamps to maintain the version.

The problem in both cases is that it is hard to know a priori how long HBase needs to keep older versions around.

In HBASE-6427 I suggest some extensions to the coprocessor framework that would allow a RegionObserver to finely control what KVs are targeted for compaction, by allowing it to create the scanner that is used to scan the incoming KVs.

RegionObserver has three new hooks:
  • preFlushScannerOpen - called before a scanner iterating over the MemStore being flushed is created
  • preCompactScannerOpen - called before a scanner iterating over all StoreFiles to be compacted is created
  • preStoreScannerOpen - called before a user-initiated scan is started
These hooks can return a custom scanner to define the set of KVs that should copied over to the compacted/flushed files.
(The various internal scanner interfaces in HBase are in need of some consolidation work - pre{Flush|Compact}ScannerOpen return an InternalScanner, whereas preStoreScannerOpen return a KeyValueScanner - but that is a different story).
 
HBASE-6427 also makes some HBase internal data structures accessible to coprocessors, so that a simple RegionObserver implementation could return a slightly modified StoreScanner (public API site is not, yet, updated so a link to the Javadoc is not availabel) from any of these hooks.

Looking at test classes introduced by the changed might be interesting, as they have a RegionObserver implement the default logic and then verify that the behavior of various operations is indeed unchanged.

For example now it would be possible for a RegionObserver implementation to listen to a ZooKeeper node, which could indicate what data can safely eliminated (i.e. not be visible to any of the scanner returned from these hooks). An incremental update process can use this to indicate the last time a successful backup was executed.

I intend to provide an implementation of a sample ZK based coprocessor to listen for scan policy changes (HBASE-6496).

Note: Coprocessors are developer tools and not for the casual user. The intend of these changes was to make this form of control possible, not necessarily easy. The very nature of the flush/compaction process makes interfering with them a difficult proposition.