1、1 Disaster Recovery Solution for Apache Hadoop Contributors: Chen Haifeng (), Gangumalla Uma (), Dey Avik (), Li Tianyou (), Purtell, Andrew () Contents Introduction 2 Targets 2 Approach . 2 Design 5 Concept of Region . 5 Cluster Joining . 5 Synchronous Writing . 5 Synchronous Data Writing 6 Synchro
2、nous Namespace Journaling 7 Handling Journaling Failures . 8 Asynchronous Replication . 8 Asynchronous Data Replication 9 Asynchronous Namespace Journaling 10 Mirror Cluster Unavailable 11 Failover 11 Data Block Replication 13 Performance Considerations 13 2 Introduction Apache Hadoop is architected
3、 to operate efficiently at scale for normal hardware failures within a datacenter. It is not designed today to handle datacenter failures. Although HDFS is not designed for nor deployed in configurations spanning multiple datacenters, replicating data from one location to another is common practice
4、for disaster recovery and global service availability. There are current solutions available for batch replication using data copy/export tools. However, while providing some backup capability for HDFS data, they do not provide the capability to recover all your HDFS data from a datacenter failure a
5、nd be up and running again with a fully operational Hadoop cluster in another datacenter in a matter of minutes. For disaster recovery from a datacenter failure, we should provide a fully distributed, zero data loss, low latency, high throughput and secure HDFS data replication solution for multiple
6、 datacenter setup. Targets The following list the targets of this design: 1. Support both synchronous writing and asynchronous replication for data and namespace. 2. Configuring and managing of the disaster recovery feature should be simple. 3. All the core disaster recovery functionalities are achi
7、eved by using or improving the existing HDFS architecture with nice fits of concepts. Approach The basis of this solution is to have one or more mirror Hadoop clusters which will be continuously updated with the data from the primary cluster in either a synchronous method or an asynchronous method b
8、y utilizing and improving the existing HA feature, data block replication and data block pipelining. In this solution, we support both synchronously writing and asynchronously replication across datacenters for both namespace and data block. The following architecture diagram shows the overall archi
9、tecture of this solution. 3 The following are the key design points, 1. By improving the HDFS to support the concept of mirror cluster, we can have a single primary cluster and multiple mirror clusters across multiple datacenters. Each cluster will still has one Active NameNode and one Standby NameN
10、ode. The Active NameNode in each cluster will behave different according to their cluster role. 2. There are DataNodes in both primary cluster and the mirror clusters. As normal, the DataNodes will only heartbeat and report blocks to the NameNodes of its local cluster. Thats to say, all the DataNode
11、s of the primary cluster heartbeat and report blocks to the Active NameNode and Standby NameNode of primary cluster. And all the DataNodes of the mirror clusters heartbeat and report blocks to the Active NameNode and Standby NameNode of mirror cluster. 3. Writing data directly to mirror cluster will
12、 have performance drop, but for some users may need more data availability than performance. So, we target to provide two options to the users in configurable way. By default we keep the asynchronous data replications to mirror clusters. 4. To achieve synchronous data writing, we can provide new pla
13、cement policy in primary cluster which needs to make sure that it is keeping the mirror cluster DataNode in pipeline along with primary DataNodes. The mirror cluster DataNodes always be at the end of the pipeline. So, primary cluster should know about the available DataNodes in mirror cluster. Mirro
14、r cluster Active NameNode will heartbeat to the primary Active NameNode with a special command called MIRROR_DATANODE_AVAILABLE (contains DatanodeInfo with space, load, etc.). The primary Active NameNode keep this details and will be used by the mirror placement policy while selecting node for pipel
15、ine. To satisfy real synchronous data replication, we make sure at least one DataNode selected from mirror cluster. But we will not keep this as strict requirement 4 until user explicitly mention strong replication need. Otherwise replication will happen asynchronously via replication scheduling mec
16、hanism. 5. To achieve asynchronous data replication, no remote DataNodes will be selected by the block placement policy when the client is writing data in primary cluster. Instead, the mirror cluster Active NameNode schedules the block replications from remote site. But the mirror cluster needs to k
17、now the block locations from primary to schedule replication. So, the mirror cluster Active NameNode will find blocks for replication and select set of local DataNodes to transfer block. Since mirror cluster not aware of block locations at primary cluster, it will just include batch of replication c
18、ommands (MIRROR_REPLICATION_REQUEST which contains blocks with targets selected) to primary cluster Active NameNode via heartbeats. On processing the commands at Active NameNode of primary cluster, it will just find the source DataNode and schedule replication commands to it via existing BLOCK_TRANS
19、FER command. 6. When the system is configured to do synchronous namespace replication, the Active NameNode will be configured with a new shared Journal which writes the edit logs directly to the Active NameNode of the mirror cluster. When the mirror Active NameNode receives the flushed edit logs fro
20、m primary cluster, it applies the operations to the in memory namespace and then write the edit logs to its local Shared Journal. In this way, the edit logs are guaranteed to be written successfully to the mirror cluster. This is to avoid edit transaction loss when primary cluster met unrecoverable
21、crash in synchronous mode. The Standby NameNodes of both primary and mirror cluster will still tail edit logs from their local Shared Journal respectively. 7. When the system is configured to do asynchronous namespace replication, the Active NameNode of the primary cluster only writes edit logs to t
22、he local Shared Journal. The Active NameNode of mirror cluster will tail edit logs from the Shared Journal of primary cluster. The Active NameNode of mirror cluster applies the tailed edit logs and then writes to its local Shared Journal. The Standby NameNodes of both primary and mirror cluster will
23、 still tail edit logs from their local Shared Journal respectively. 8. The primary cluster Active NameNode is same with normal Active NameNode with some additional functionalities, such as handling MIRROR_DATANODE_AVAILABLE and block placement policy to support synchronous block writing and handling
24、 MIRROR_REPLICATION_REQUEST for replicates. The mirror cluster Active NameNode is similar with normal Active NameNode but with several significant differences and additional functionalities. The Active NameNode in mirror cluster will not allow any writes operations to the file system through client
25、API. Instead, it will behave to receive or tail edit logs from primary cluster and update its namespace. It still writes edit logs to local Shared Journal. It also schedule active replications and heartbeats the commands (MIRROR_DATANODE_AVAILABLE, MIRROR_REPLICATION_REQUEST) to the Active NameNode
26、of primary cluster. To simplify the things, current version targets to provide manual switch over to mirror clusters. So, the admin explicitly issues command to mirror cluster to convert the mirror cluster to primary cluster role. 5 Design Concept of Region From the design, we would support interact
27、ion between multiple clusters and these clusters are deployed in different datacenters. The Active NameNode in primary cluster take different role with the Active NameNode in mirror cluster. And they need to behave differently according to the cluster role. To help the NameNode distinguish the role
28、of cluster, we use the concept of “region”. Each site or datacenter are conceptually considered as a region and identified by a “regionID”. NameNode knows the region it belongs to and also knows the fact of the current primary region. Based on this information, NameNode can properly act and interact
29、 with the components in other regions accordingly. Also, in cluster node setups, configurations should include all the available regions and the NameNodes and Journal URI of the regions. And this part of information should be consistent across all setups. Cluster Joining When a new mirror cluster ne
30、eds to join to an existing configuration, we use the similar bootstrap process as the current implementation of Standby NameNode joining. First, the admin needs to initialize the Shared Journal of the mirror cluster by writing all the edit logs after the last checkpoint from primary cluster. Secondl
31、y, the Active NameNode of the mirror cluster will be bootstrapped by downloading and applying the FSImage of the last checkpoint of primary cluster. Finally, the Standby NameNode of the mirror cluster will be bootstrapped to downloading and applying the FSImage of the last checkpoint of mirror clust
32、er. After the joining process, the asynchronous data block replication can be scheduled by the Active NameNode of mirror cluster to localize all the remote data blocks. And the replication process is asynchronous and how long it will last depends on the data size of the primary HDFS cluster. During
33、the catchup period, the mirror cluster may enter into SafeMode as it finds that there are too many missing blocks. We need to make sure that even if the mirror cluster is in SafeMode, it doesnt prevent mirroring functionality such as tailing the edit logs, replicating of remote blocks and writing sy
34、nchronous data blocks. Synchronous Writing Synchronous writing means two things. First, when the client is writing data to the HDFS of the primary cluster, the data will also write to the mirror cluster. So that when disaster happens in the primary cluster, the data already claimed to be written wil
35、l not be lost. Second, if one namespace operations is performed successfully in primary cluster, we need it also be in mirror cluster eventually even in the case of primary cluster disaster. 6 Writing data directly to mirror cluster will have performance drop. But for the users need more data availa
36、bility than performance on critical data, synchronous namespace and data writing would be the right choice. Please note that in this design, the synchronous data writing and asynchronous data replication can coexist in a single configuration. While for namespace, a single configuration can either su
37、pport synchronous journaling or asynchronous journaling, but not both. Synchronous Data Writing The solution will support both synchronous and asynchronous data writing. There are requirements that any data loss for critical data is not acceptable. For this kind of data, we can configure to use sync
38、hronous data writing for achieve “Zero Loss”. When the client is writing block data, the data block is pipelined to both local DataNodes in primary cluster and remote DataNodes in mirror clusters. The following diagram shows the basic workflow for synchronous data writing. 1. When a client is writin
39、g a HDFS file, after the file is created, it starts to request a new block. And the Active NameNode of primary cluster will allocate a new block and select a list of DataNodes for the client to write to. By using the new mirror block placement policy, the Active NameNode can guarantee one or more re
40、mote DataNodes from the mirror cluster are selected at the end of the pipeline. 2. The primary cluster Active NameNode knows the available DataNodes of the mirror cluster via heartbeats from mirror clusters Active NameNode with the MIRROR_DATANODE_AVAILABLE command. So, latest reported DataNodes wil
41、l be considered for the mirror cluster pipeline which will be appended to primary cluster pipeline. 7 3. As usual, upon a successful block allocation, the client will write the block data to the first DataNode in the pipeline and also giving the remaining DataNodes. 4. As usual, the first DataNode w
42、ill continue to write to the following DataNode in the pipeline. 5. The last local DataNode in the pipeline will continue to write the remote DataNode that following. 6. If there are more than one remote DataNodes are selected, the remote DataNode will continue to write to the following DataNode whi
43、ch is local to the remote DataNode. We provide flexibility to users that they can even configure the mirror cluster replication. Based on the configured replication, mirror nodes will be selected. Synchronous Namespace Journaling It is sometimes critical that the namespace edit logs will not be lost
44、 when disaster happens. When the system is configured to do synchronous namespace replication, the Active NameNode will be configured with a new shared Journal which writes the edit logs directly to the Active NameNode of the mirror cluster. When the mirror Active NameNode receives the flushed edit
45、logs from primary cluster, it applies the operations to the in memory namespace and then write the edit logs to its local Shared Journal. In this way, the edit logs are guaranteed to be written successfully to the mirror cluster. This is to avoid edit transaction loss when primary cluster met unreco
46、verable crash in synchronous mode. The Standby NameNodes of both primary and mirror cluster will still tail edit logs from their local Shared Journal respectively. The following shows the basic journaling workflow in which a plugin Journal Manager is used to write edit logs from the primary Active N
47、ameNode to the mirror Active NameNode. 8 1. As usual, the primary cluster Active NameNode writes the edit logs to Shared Journal of the primary cluster. 2. The primary cluster Active NameNode also writes the edit logs to the mirror cluster Active NameNode by using a new JournalManager. 3. As usual,
48、the primary cluster Standby NameNode tails the edit logs from Shared Journal of the primary cluster. 4. The mirror cluster Active NameNode writes the edit logs to Shared Journal of the mirror cluster after applying the edit logs received from the primary cluster. 5. As usual, the mirror cluster Stan
49、dby NameNode tails the edit logs from Shared Journal of the mirror cluster. Handling Journaling Failures When doing synchronous namespace journaling, the primary cluster Active NameNode will write both to the local Shared Journal and to the mirror Active NameNode (mirror Journal). We need to deal with the complex situation when the sync to the one Journal succeed but the sync to the other Journal failed, thus leave the two Journals in inconsistent state. In such a situation, gap wil