Backing up Hadoop data can be challenging, the size and scope of the data set can present challenges. This blog will illustrate how the Cloudera Manager Backup and Recovery tools can be used with an Isilon cluster to backup native HDFS data and Hive/Impala data and metadata. Once the data has landed on Isilon, additional features native to OneFS can be used to backup and protect the data further. See the Isilon Hadoop Enterprise Features whitepaper for additional details.
Additional information on Cloudera Backup and Disaster Recovery can be found here on Cloudera’s site: BDR introduction
In our scenario, we have a DAS cluster and an Isilon integrated Hadoop cluster and we will backup from our active DAS cluster to an archive Isilon cluster.
It is possible to backup the data the other way also, DAS < — Isilon. If this methodology is used additional considerations will need to be addressed around active data. Since Cloudera BDR utilizes distcp, active data can present challenges that must be accounted for in the design if the replication jobs.
CDH 5.8 and greater
UID/GID parity – through local accounts or LDAP, parity in uid and gid is important to maintain consistent access across storage
DAS Cloudera cluster setup complete
Isilon Cloudera cluster setup complete
DNS Name resolution fully functional – all host, forward and reverse
Both the source and destination clusters must have a Cloudera Enterprise license
Note the following when using replication jobs for clusters with Isilon:
• hdfs user is mapped to root on Isilon, If you specify alternate users with the Run As option when creating replication schedules, those users must also be superusers.
• Always Select the ‘Skip Checksum Checks’ property when creating replication schedules.
• Kerberos authentication is fully supported from CDH 5.8 and higher, the account used to replicate data will need a principal and keytab to enable authentication against the target, see the Cloudera documentation for additional information on configuring this.
• Data replication can fail if the source data is modified during replication, it is therefore recommended to leverage snapshots as the source of data replication. If enabled replication can automatically make use of snapshots to prevent this issue. For more details see the following Cloudera documentation Using Snapshots with Replication
• Source clusters that use Isilon storage do not support HDFS snapshots. Since snapshots are used to ensure data consistency during replications in scenarios where the source files are being modified. Therefore, when replicating from an Isilon cluster source, it is recommended that you do not replicate Hive tables or HDFS files that could be modified before the replication completes without taking additional steps to ensure data replication succeeds effectively. Additional options would be to leverage SyncIQ to replicate data between Isilon clusters or using Isilon native snapshots in conjunction with metastore replication.
In our example here /user/test1; the source is native HDFS so we can enable snapshots on the directory to be replicated, Cloudera can then automatically make use of the ‘directory enabled for snapshots feature’ and use a snapshot as the source of replication.
Review the directory with the HDFS file browser in Cloudera Manager
Enable Snapshots either via the HDFS file browser or via the cli.
The directory /user/test1 is now snapshot enabled
In our example, we use a local user to generate some test data, a corresponding user on Isilon exists with the same uid and gid membership.(this could be an LDAP user also)
$ su – test1
$ cd /opt/cloudera/parcels/CDH/jars
$ yarn jar /hadoop-mapreduce-examples-2.6.0-cdh5.11.1.jar teragen 1000000 /user/test1/gen1
$ yarn jar /hadoop-mapreduce-examples-2.6.0-cdh5.11.1.jar terasort /user/test1/gen1 /user/test1/sort1
$ yarn jar /hadoop-mapreduce-examples-2.6.0-cdh5.11.1.jar teravalidate /user/test1/sort1 /user/test1/validate1
Now lets setup replication of this data from the DAS cluster to Isilon:
1. Configure a Replication Peer on the Source (Isilon Cluster), Select Peers from the backup Tab on the Isilon Cloudera Manager
2. Add a Peer
3. Name the Peer, in this example we use ‘DAS’ to make it easy, add the peer URL and the credentials to logon to the Target(DAS) Cloudera Manager
4. The Peer is validated as connected
5. Now, lets create a HDFS Replication Schedule from the Backup menu
6. From the drop select the Source; the ‘DAS’ cluster, the source path, destination ‘Isilon’ cluster and the destination path to replicate to:
A schedule can be set as needed; we select daily at 00:00AM PDT
We run this job as hdfs, since we wish to replicate the source Permissions the Run As User must have superuser privilege on the target cluster; if kerberos is in use additional steps need to be completed to enable the run as user to authenticate successfully against the target cluster.
Select the Advanced Tab
Select ‘Skip Checksum Checks’ — this must be done, otherwise replication will fail
Additional setting can be used that are specific to your environment and your requirements
7. The replication policy is now available
8. Before executing a data copy, we can execute a dry run to validate and evaluate the replication policy.
9. On execution of a successful dry run, the job can be run manually or wait for the scheduled job to run to copy data
Review the job on completion, the details of the distcp and options can be seen along with additional other information regarding the job
10. Compare the Source and Target directories; we see the data has been replicated maintaining permissions.
Source DAS cluster – /user/test1
Target Isilon cluster – /DAS/user/test1
Using HDFS replication is incremental aware.
1. Add new data to DAS – /user/test1 – gen2, sort2,validate2, tpcds
Reviewing the Source DAS cluster data – /user/test1
2. execute a replication and review the results, only the new data was copied as expected
As can be seen using HDFS replication is pretty straightforward and can be used to maintain a well structured and scheduled backup methodology for large HDFS data sets. Now, since the data is resident on Isilon additional backup methodologies can be leveraged; SyncIQ copies to other Isilon clusters, Isilon Snapshots, NDMP backups and tiering.
In the next post we will look at how Hive replication is enabled for integration between two Cloudera clusters.
Using Hadoop with Isilon – Isilon Info Hub