HBase Replication

HBase supports cluster replication, which is a way to copy data between the HBase clusters.

For example, it can be used as a way to easily ship edits from a real-time frontend cluster to a

batch purpose cluster on the backend.

The basic architecture of an HBase replication is very practical. The master cluster captures

write ahead log (WAL), and puts replicable Key/Values (edits of the column family with

replication support) from the log into the replication queue. The replication message is then

sent to the peer cluster, and then replayed on that cluster using its normal HBase client API.

The master cluster also keeps the current position of the WAL being replicated in ZooKeeper

for failure recovery.

Because the HBase replication is done asynchronously, the clusters participating in the

replication can be geographically distant. It is not a problem if the connections between them

are offline for some time, as the master cluster will track the replication, and recover it after

connections are online again. This means that the HBase replication can serve as a disaster

recovery solution at the HBase layer.

We will look at how to enable the replication of a table between two clusters, in this recipe.

You will need two HBase clusters—one is the master, and the other is the replication peer

(slave) cluster. Here, let’s say the master is master1:2181/hbase, and the peer is

l-master1:2181/hbase; the two clusters do not need to be of the same size.

ZooKeeper should be handled independently, and not by HBase. Check the HBASE_MANAGES_ZK

setting in your hbase-env.sh file, and make sure it is set to false.

All machines, including the ZooKeeper clusters and HBase clusters, need to be able to reach

other machines. Make sure both clusters have the same HBase and Hadoop major version.

For example, having 0.92.1 on the master and 0.92.0 on the peer is correct, but 0.90 is not

Follow these instructions to replicate data between HBase clusters:

  1. Add the following code to HBase’s configuration file (hbase-site.xml) to enable

replication on the master cluster:

hadoop@master1$ vi $HBASE_HOME/conf/hbase-site.xml

<property>

<name>hbase.replication</name>

<value>true</value>

</property>

  1. Sync the change to all the servers, including the client nodes in the cluster, and

restart HBase.

  1. Connect to HBase Shell on the master cluster and enable replication on the table you

want to replicate:

hac@client1$ $HBASE_HOME/bin/hbase shell

hbase> create ‘reptable1’, { NAME => ‘cf1’, REPLICATION_SCOPE =>

1}

If you are using an existing table, alter it to support replication:

hbase> disable ‘reptable1’

hbase> alter ‘reptable1’, NAME => ‘cf1’, REPLICATION_SCOPE => ‘1’

hbase> enable ‘reptable1’

  1. Execute steps 1 to 3 on the peer (slave) cluster as well. This includes enabling

replication, restarting HBase, and creating an identical copy of the table.

  1. Add a peer replication cluster via HBase Shell from the master cluster:

hbase> add_peer ‘1’, ‘l-master1:2181:/hbase’

  1. Start replication on the master cluster by running the following command:

hbase> start_replication

  1. Add some data into the master cluster:

hbase> put ‘reptable1’, ‘row1’, ‘cf1:v1’, ‘foo’

hbase> put ‘reptable1’, ‘row1’, ‘cf1:v2’, ‘bar’

hbase> put ‘reptable1’, ‘row2’, ‘cf1:v1’, ‘foobar’

You should be able to see the data appear in the peer cluster table in a short while.

  1. Connect to HBase Shell on the peer cluster and do a scan on the table to see if the

data has been replicated:

hac@l-client1$ $HBASE_HOME/bin/hbase shell

hbase> scan ‘ reptable1’

ROW COLUMN+CELL

row1 column=cf1:v1, timestamp=1326095294209, value=foo

row1 column=cf1:v2, timestamp=1326095300633, value=bar

row2 column=cf1:v1, timestamp=1326095307619, value=foobar

2 row(s) in 0.0280 seconds

  1. Verify the replicated data on the two clusters by invoking the verifyrep command

on the master cluster:

hac@client1$ $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-

0.92.1.jar verifyrep 1 reptable1

12/01/09 16:50:22 INFO replication.ReplicationZookeeper:

Replication is now started

12/01/09 16:50:24 INFO mapred.JobClient: Running job:

job_201201091517_0005

12/01/09 16:50:25 INFO mapred.JobClient: map 0% reduce 0%

12/01/09 16:50:46 INFO mapred.JobClient: map 100% reduce 0%

12/01/09 16:50:51 INFO mapred.JobClient: Job complete:

job_201201091517_0005

12/01/09 16:50:51 INFO mapred.JobClient: Counters: 19

12/01/09 16:50:51 INFO mapred.JobClient: File Output Format

Counters

12/01/09 16:50:51 INFO mapred.JobClient: Bytes Written=0

12/01/09 16:50:51 INFO mapred.JobClient: org.apache.hadoop.

hbase.mapreduce.replication.VerifyReplication$Verifier$Counters

12/01/09 16:50:51 INFO mapred.JobClient: GOODROWS=2

We skipped some output of the verifyrep command to make it clearer.

  1. Remove the replication peer from the master cluster by using the following command:

hbase> remove_peer ‘1’

 

Replication is still considered an experimental feature, and it is disabled by default. In order

to enable it, we added the hbase.replication property into HBase’s configuration file

(hbase-site.xml) and set it to true. In order to apply the change, we sync it to all nodes,

including the client node in the cluster, and then restart HBase in step 2. Data replication is

configured at column family level. Setting a column family with the REPLICATION_SCOPE =>

‘1’ property enables that column family to support replication. We did this in step 3, by either

altering an existing table, or creating a new one with the replication scope set to 1.

For the peer cluster, we did the same procedure in step 4—enabling replication support and

creating an identical table with the exact same name— for those replicated families.

With the preparation done between steps 1 and 4, we add the replication peer cluster to

the master cluster in step 5, so that edits can be shipped to it subsequently. A replication

peer is identified by an ID (1 in our case) and a full description of the cluster’s ZooKeeper

quorum, whose format is hbase.zookeeper.quorum:hbase.zookeeper.client.

port:zookeeper.znode.parent, such as server,server2,server3:2181:/hbase.

After that, we start the actual shipping of edit records to the peer cluster.

To test our replication setting, we put some data into the table, and after a while, as you can

see from the output of the scan command on the peer cluster, data has been shipped to the

peer cluster correctly. While this is easy to do when looking at only a few rows, the better way

is to use the verifyrep command to do a comparison between the two tables. The following

is the help description of the verifyrep command:

hac@client1$ $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.92.1.jar

verifyrep

Usage: verifyrep [–starttime=X] [–stoptime=Y] [–families=A] <peerid>

<tablename>

Options:

starttime beginning of the time range

without endtime means from starttime to forever

stoptime end of the time range

families comma-separated list of families to copy

Args:

peerid Id of the peer used for verification, must match the one

given for replication

tablename Name of the table to verify

Examples:

To verify the data replicated from TestTable for a 1 hour window with

peer #5

$ bin/hbase org.apache.hadoop.hbase.mapreduce.replication.

VerifyReplication –starttime=1265875194289 –stoptime=1265878794289 5

TestTable

Running verifyrep from the hadoop jar command, with parameters of the peer ID

(the one used to establish a replication stream in step 5) and the table name, will start a

MapReduce job to compare each cell in the original and replicated tables. Two counters

are provided by using the verifyrep command—Verifier.Counters.GOODROWS and

Verifier.Counters.BADROWS. Good rows means that the rows between the two tables

were an exact match, while the bad rows are the rows that did not match. As our data was

replicated successfully, we got the following output:

12/01/09 16:50:51 INFO mapred.JobClient: GOODROWS=2

If you got some bad rows, check the MapReduce job’s map log to see the reason.

Finally, we stop the replication and remove the peer from the master cluster. Stopping

the replication will still complete shipping all the queued edits to the peer, but not accept

further processing.

Commands for replication (HBase version 1.0):

add_peer, append_peer_tableCFs, disable_peer, enable_peer, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs

Reference :

http://archive.cloudera.com/cdh5/cdh/5/hbase/apidocs/org/apache/hadoop/hbase/replication/package-summary.html

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_bdr_hbase_replication.html

http://blog.cloudera.com/blog/2012/07/hbase-replication-overview-2/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s