Huawei cloud computing ie interview notes – please describe the panorama of Huawei disaster recovery solutions and explain from which perspectives the dual active data center needs to consider the dual active design

Time:2022-4-25

Panorama of disaster recoveryAccording to the distance, it is divided into local disaster recovery, local disaster recovery and remote disaster recovery

   Local disaster recovery includes local high availability and local active standby. (two machine rooms and cabinets of the data center)

In order to maintain business continuity, the local high availability scheme is considered from two levels:

① One is to consider from the host and server level. If the virtual machine or business on this server hangs, it can be automatically activated from other servers to ensure business continuity, mainly through the characteristics of cluster ha, DRS and DPM.

② The other is to consider from the storage level, using hypermetro feature / hypermirror + smartvirtualization (heterogeneous storage, RTO, RPO > 0). Hypermetro recommended

Hypermetro features dual active based on two sets of storage arrays. The dual active Lun data of the arrays at both ends are synchronized in real time, and both ends can simultaneously process the # I / O # read and write requests of the application server. It provides the application server with undifferentiated parallel access capability of two sites. When any disk array fails, the service will automatically and seamlessly switch to the end-to-end storage access without interruption.

Local highly available RTO = 0, RPO = 0.

Local active and standby disaster recovery is realized by synchronous remote replication technology. RPO=0、RTO>0

The primary and standby are mainly the remote replication of the storage layer.

Local disaster recovery includes local main standby disaster recovery and local dual active disaster recovery. (two data centers in the same city (within 300km))

The local active and standby disaster recovery is realized through synchronous remote replication technology and asynchronous remote replication technology, which are different when writing replication. Synchronous remote replication returns write success when the opposite side (disaster recovery side) writes io. Asynchronous remote replication returns write success when the local side writes io. The active / standby disaster recovery technology is very mature, but the business will be stopped. RTO > 0. For synchronous remote replication technology, RPO = 0. For asynchronous remote replication technology, RPO > 0.

The technology of local dual activity disaster recovery is realized through Huawei’s hypermetro. When the business at one end fails, the system at the other end can be pulled up automatically, but the cost of using dual activity is high. Theoretically, RTO = 0 and RPO = 0.

Remote disaster recovery includesTwo places and three centersandActive standby disaster recovery。 (the distance between two data centers is more than 300km)

The two three centers are dual active or active standby in the same city, and then realize the two three center technology with remote active standby. They use parallel or cascade technology, which is suitable for scenarios where business is not allowed to be interrupted and data requirements are strict, such as financial institutions. RTO and RPO are flexible.

The other is the remote active standby disaster recovery method, which is different from the local active standby disaster recovery method. It is mainly realized by asynchronous remote replication. RTO>0、RPO>0

 

Dual active data center:

  Double living should be considered from a comprehensive point of viewSecurity layerTransport layerBusiness (application) layernetwork layerComputing ClusterStorage layerTo consider

The security layer ensures access security through firewall, security policy planning and design.

The transmission layer constructs a reliable dual active transmission network through equipment, link and board redundancy. Wavelength division equipment improves the transmission distance of optical signals. Realize link multiplexing and reduce two pairs of links in the data center to one pair of links.

The dual activity of the service (application) layer is realized through load balancing, that is, GSLB and SLB, and through application cluster technology and database cluster technology. C / s realizes the business load through the application cluster’s own characteristics (Oracle RAC, IBM DB2), and B / s realizes the business load through lb.

IP and FC network interworking between dual active data centers at the network layer. Double live is realized through global load balancer (GSLB) and server load balancer (SLB). If a service request message comes from the outside, SLB will check whether the host in its data center is available and whether it can provide services. If a host failure is found, GSLB will be triggered, and then GSLB will notify the glsb at the opposite end to start working. GSLB will notify the SLB below to work. SLB will notify the host below to respond to the service request, and then return the response to the opposite end to complete dual activity.

    SLB\GSLB:

Different objects:

SLB: traffic load in the data center and failover in the data center

GSLB: achieve the load of traffic between the two data centers to ensure that users can use the nearest server, so as to ensure the access quality.

The principle is different:

SLB selects which web server the user chooses through proxy.

GSLB uses DNS resolution to control which data center users access

Computing cluster dual activity adds the servers of two data centers to the same large cluster. Ha \ DRS is mainly used

The dual active storage layer is generally Huawei’s storage, which uses the feature of hypermetro to realize RTO and RPO = 0. On the write IO, the upper layer issues a write io. First, there will be a write mutual exclusion in the storage at both ends. First, get the write permission at one end, and then start writing. After writing, the other end will return to write success after writing. When reading, the ultra path feature is used. An optimal path is selected according to the ultra path feature. A design of the gateway is free from the need for vis. Instead of vis, an optimal path is re selected to reduce equipment and overhead. Moreover, hypermetro can also be used for heterogeneous storage. These storage arrays are made into vdisk for unified use.

*What are the structures of the three centers in the two places? What’s the difference?

Figure 1 synchronous + asynchronous (cascaded network)

 

 

 

Figure 2 asynchronous + asynchronous (cascade network)

 

 

 

Figure 3 synchronous + asynchronous (parallel Networking)

 

 

 

Figure 4 asynchronous + asynchronous (parallel Networking)

 

 

 

Figure 5 multi pair one-level joint network

 

 

 

Figure 6 dual active + Remote Replication networking

 

 

 

*What is the difference between local high availability and dual active data center? (tested)

1. Their usage scenarios are different. Local high availability is disaster recovery at the same data center or computer room level, while dual active data center is remote, and disaster recovery is at the regional level;

2. Local high availability can be realized by hypermirror + smartvirtualization, but dual active is not available;

3. Multipath: when hypermetro is used for local high availability, the multipath software is set to load balancing, and the dual active configuration local site takes precedence;

4. Dual active requires GSLB, SLB, and local high availability does not need it;

5. dual activeness needs to consider six levels of dual activeness, such as the security layer and the transport layer. Local high availability can not consider the dual activeness of the network layer, the transport layer and the security layer. It needs to consider the dual activeness of the application, computing and storage layers (the network can be considered, and the distance is not the focus).

6. Cost: because the physical equipment of other three layers should be considered, the cost of double living is high

7. Protection scope: local high availability is in the data center; Dual active range: the maximum distance between two data centers can reach 300km

*Difference between local high availability and active standby

1. Protection scope: the local data center can only be used, and the active and standby can be in the same city or in different places

2. Using hypermetro for remote replication

3. RTO and RPO: in theory, the local high availability RTO and RPO can be equal to 0, and the active and standby are not necessarily

4. Working mode: local high availability provides services at the same time, and the active and standby only provide services

5. Load balancing: the active and standby cannot achieve load balancing

*Difference between active and standby

1. Working mode: both active and standby services are provided at the same time, and only the active and standby services are provided

2. Load balancing: dual active can realize load balancing. Active and standby are not allowed

3. Resource utilization: active and standby can only reach 50%; Because when the master is working, the standby is not working

4. RTO and RPO: double live can theoretically achieve RTO = 0 and RPO = 0

5. Implementation method: storage layer, dual active is hypermetro, active and standby is synchronous / asynchronous remote replication

6. Scope: remote active and standby can be > 3000km

7. Network: the dual active service needs to be connected, and the active and standby service networks do not need to be connected

8. Service switching: dual active can automatically switch sites in case of disaster, and the active and standby need to be manually pulled up by the administrator

*The difference between synchronous remote replication and asynchronous remote replication

Both synchronous remote replication and asynchronous remote replication can be used in the disaster recovery scheme of the storage system to realize the remote backup of data. They are implemented in different ways and applied to different business scenarios.

Implementation mode

• synchronous remote replication is to send the I / O write request to the slave Lun while writing to the master Lun. When both the master Lun and the slave Lun return the successful write request, the host I / O write request is returned, so as to realize the close synchronization of the data of the master Lun and the slave Lun.

• asynchronous remote replication is to record the modified data of this write operation at the primary site while writing to the primary Lun. After the primary Lun returns a successful write request, it returns to the host. Then, the synchronization operation is triggered manually by the user or automatically according to the trigger conditions set by the user to ensure that the data of the master Lun and the slave Lun are consistent.

Application scenario

• because synchronous remote replication has high requirements for bandwidth and data delay, synchronous remote replication is mainly used in disaster recovery scenarios where the master device and the slave device are close to each other, such as urban disaster recovery backup.

• asynchronous remote replication has low requirements for bandwidth and data delay, so asynchronous remote replication is suitable for disaster recovery scenarios with long distance or limited network bandwidth.

*Does dual active data center need to realize dual active at every layer?

It is not necessary to realize double live for each layer. The definition of dual business data centers refers to that two business data centers can be provided at the same time. Therefore, for the business, it can be deployed in the active standby mode or AA mode. Put the active and standby on different sides.

howeverDual active storage tierIn addition, it is necessary to provide real-time data for external business and synchronization at the same time.

*Principle of synchronous replication

 

 

 

 

 

*Asynchronous replication principle

 

 

 

*How hypermetro works

 

 

 

1. The host sends the write I / O to the dual active management module.

2. The system records log.

3. Execute dual write: the dual active management module writes the write I / O to the local cache and the remote cache at the same time.

4. The local cache and remote cache return the write I / O results to the dual active management module.

5. Double write result processing: wait until the write processing results of caches at both ends are returned before returning the write I / O results to the host.

6. Judge whether double writing is successful.

• if all writes are successful: clear log.

• if one end fails: log is converted to DCL, and the difference data between the local Lun and the remote Lun is recorded.

*Cloud Disaster Tolerant Service (admitted)

 

 

 

CSDR, production storage and disaster recovery storage replication.

CSHA, mirror or dual active.

VHA mirror or dual active.

What is the difference between disaster recovery and backup? The difference between backup and disaster recovery

Version 1:

1. Backup is the basis of disaster recovery. Generally speaking, disaster recovery refers to the backup of data or application system not in the same computer room, and backup refers to the backup of local data or system

2. Backup protects business data, and disaster recovery protects the whole business system (with supporting host, storage, network equipment, etc.)

3. The data format processed by the backup software is inconsistent (usually stored in the computer where it will be re deleted, compressed and archived), and can be used only after recovery. The data format processed by the copy or mirror software does not change, and can be used directly by attaching it to the host.

4. Inconsistent data replication cycle or shorter data protection time.

5. Generally, backup is the last line of defense for data protection, and archiving is preferred.

Version 2:

1. Object difference: backup is for data and disaster recovery is for it system (business)

2. Level: backup is the basis of disaster recovery

3. Different distances: the two disaster recovery sites are far away from the backup site

Introduce Huawei disaster recovery panorama

 

 

 

The understanding idea should be:

Scope: within 300km of local, city and other places

Demand: RPO and RTO values (these are two key indicators to measure the disaster recovery solution, which should be considered and compared)

The above solutions are not all, but only those recommended in the corresponding disaster recovery scope >

Storage layer implementation of disaster recovery solution

 

 

 

In PPT, Huawei’s disaster recovery solution focuses more on the realization of its own products, so the storage layer is the key, and the storage layer is the most commonly used means of data disaster recovery and the basis of disaster recovery.

Here, the RPO and RTO values only consider the storage layer. The RPO and RTO of the overall solution also need to consider the upper layer, such as application layer, network layer, computing layer, etc

The selection of storage layer characteristics depends on the range of disaster recovery and the RPO and RTO values to be determined

Since vis has stopped production, it is not mentioned in the above table. Vis can realize asynchronous replication and double live

The three Center disaster recovery solution of the two places is actually a combination of local disaster recovery + remote disaster recovery. The specific RPO and RTO values need to consider the disaster level. For example, if it is at the data center level, consider local disaster recovery. If it is at the city level, consider remote disaster recovery

Due to the large disaster recovery distance in different places, which is usually more than 100km or even 200km, asynchronous remote replication can only be used, and WAN links are mostly used

>

Distance requirements of storage characteristics

Hypermetro only supports disaster recovery within 100km, and wavelength division equipment is required beyond 25km to reduce optical attenuation and link multiplexing and reduce cost

Synchronous remote replication only supports disaster recovery within 200km

Asynchronous remote replication has no distance requirements, and the appropriate synchronization cycle can be configured according to the distance

RPO、RTO

RPO marks the maximum amount of data loss that the system can tolerate. The smaller the amount of data that the system can tolerate loss, the smaller the value of RPO  RTO marks the longest time that the system can tolerate service stop. The higher the urgency of system service, the smaller the RTO value

 

RPO and RTO are two key indicators of disaster recovery

RPO (recovery point objective): that is, the target of data recovery point, in time, that is, the point in time that the system and data must be recovered in case of disaster. RPO marks the maximum amount of data loss that the system can tolerate. The smaller the amount of data the system tolerates to be lost, the smaller the RPO value.

RTO (recovery time objective): that is, the recovery time objective, in time, that is, the time requirement from the stop of information system or business function to the necessary recovery after a disaster. RTO marks the longest time that the system can tolerate service stop. The higher the urgency of system service, the smaller the RTO value.

RPO aims at data loss, while RTO aims at service loss. The determination of RTO and RPO must be determined according to different business needs after risk analysis and business impact analysis.

What are the advantages of two places and three centers?

Compared with local disaster recovery, it can deal with Regional Disasters and has higher disaster recovery capacity;

Compared with remote disaster recovery, it has smaller RPO or RTO in case of production site failure.

What is a dual active data center?

Dual active data center solution means that two data centers are in operation at the same time, undertake business at the same time, and improve the overall service capacity and system resource utilization of the data center. The data of the two data centers shall be consistent in real time. In case of single equipment failure or even one data center failure, the business will be switched automatically, with zero data loss and zero business interruption.

The dual active data center has the following characteristics:

It belongs to the local disaster recovery scheme

The two data centers provide external services at the same time

Mutual disaster recovery and automatic switching

Zero data loss and zero interruption of some services (non interruption of services is not a necessary standard)

How to realize six layer dual active data center

 

 

 

 

 

 

 

Hypermetro? Must be adopted in the storage layer of dual active data center?

If the storage layer adopts Huawei V3 storage device, hypermetro must be adopted. Hypermetro can make the storage layer dual active. Using this feature, it can cooperate with the automatic switching of application layer or computing layer to ensure the availability of data and the optimal path.

The storage dual active feature provided by hypermetro is the key. If the storage layer can achieve the purpose of dual active or quasi dual active by using other products or methods, it is also possible.

>

Networking and specific principle of local high availability (dual active hypermeter)

Including local double live and local mirror.

Double active principle:

 

 

 

 

 

 

 

 

 

Local mirror:

   

 

 

 

 

 

 

Local high availability networking diagram

 

 

 

In the figure above, only storage layer networking is considered and cross networking is adopted

Smartvirtualization + hypermirror networking is the same

In addition to the cooperation of array characteristics in the storage layer, clustering is required in the host layer and application layer.

>

Networking and specific principle of active and standby disaster recovery (VRG active and standby)

Only the networking on the storage side is introduced here. For the host side, please see the implementation method of disaster recovery in fusionsphere scenario

Note: the primary and standby disaster recovery computing layers are two VRMs

 

 

 

 

 

Network diagram of active and standby disaster recovery

 

 

 

The storage layer is directly connected to the network device

The storage layer needs to be connected, and the business network does not need to be connected

Fusionsphere primary and standby disaster recovery storage replication disaster recovery networking

 

 

 

 

 

The networking diagram in Figure 1 is not rigorous, and the communication between two bcmanagers is not realized. It can be a separate network, which can be realized by connecting the two management planes.

The storage device needs to be connected to the management network and taken over by bcmanager

The two site storage device realizes data disaster recovery through the storage layer network device link of the two sites

The local high availability, dual active, two places and three centers realized by storage replication disaster recovery are similar to the general disaster recovery networking

Fusionsphere primary and standby disaster recovery host replication disaster recovery networking

 

 

 

 

 

The networking diagram in Figure 1 is not rigorous, and the communication between two bcmanagers is not realized. It can be a separate network, which can be realized by connecting the two management planes.

No need to open the storage plane

Bcmanager does not need to manage storage devices

It is recommended that the CNA node be separately configured with a disaster recovery business management interface to communicate with VRG. If it is not configured, go through the CNA management interface

The host IO replication plane needs to be configured for the communication between VRG and VRG

VRG needs to be taken over by bcmanager

How hyperereplication works

Synchronous remote replication and asynchronous remote replication

The master-slave side of synchronous remote replication does not take snapshots, while the master-slave side of asynchronous remote replication takes snapshots. The purpose of source side snapshot is to freeze data and ensure data consistency. The purpose of slave side snapshot is to assume asynchronous failure and roll back to the state before start.

>

Networking and specific principle of dual active data center

Note: double live disaster recovery is a VRM

 

 

 

 

 

The storage layer is directly connected to the network device or connected through the wavelength division device

The service network is directly connected to the core switch or through the wavelength division equipment

The storage layer needs to be connected and the business network needs to be connected

The reasons for using wavelength division equipment are: reducing optical attenuation and reducing the cost of DCI link (from 4 pairs of bare optical fibers to 2 pairs)

For the dual active data center, except that the storage layer needs the cooperation of array characteristics, the host layer and application layer need to be clustered.

Six layer dual active: storage layer, host layer, application layer, network layer, security layer and transport layer

Networking and specific principles of two places and three centers. What’s the difference?

 

 

 

 

 

 

 

 

 

 

 

 

 

Implementation of fusionsphere scene disaster recovery

Host layer networking

 

 

 

Background: when fusionstorage or non Huawei storage is used as storage resources, the combination of storage remote replication and bcmanager cannot be used to realize disaster recovery. Therefore, move the storage replication function up and let fusionsphere realize data replication in the master-slave data center. At this time, VRG needs to be deployed in fusionsphere (essentially a virtual machine, which acts as a virtualization remote replication gateway to realize host layer remote replication), The slave side also needs to deploy VRG as the receiver of data replication.

Principle: when using host replication for disaster recovery, production sites and disaster recovery sites are generally established in two places respectively. The IO streaming technology of fusioncompute virtualization platform is used to capture the real-time IO data of the virtual machine on the production side and asynchronously copy it to the virtual machine volume on the disaster recovery side.

VRG is deployed in the production site and disaster recovery site respectively. The iomirror in the host is responsible for capturing the IO of the virtual machine. The production side VRG aggregates the IO data of the virtual machine and sends it to the disaster recovery side VRG. The VRG of the disaster recovery side is distributed to the writeagent of the specified host, and the writeagent writes it to the volume of the virtual machine to complete the remote asynchronous replication of the IO of the virtual machine.

 array layer disaster recovery:

1. The remote replication characteristics of underlying SAN storage are used to realize the data replication and protection of disaster recovery virtual machine. It can support synchronous remote replication and asynchronous remote replication.

2. The data consistency protection of disaster recovery virtual machine is realized by using the hypermetro dual active feature of underlying SAN storage.

What disaster recovery products does Huawei have

1. Rd: database application disaster recovery

2. Ultravr: virtualized disaster recovery

3. Bcmanager: integrate disaster recovery scenarios such as database and virtualization application

4. Bcmanager ereplication: disaster recovery for openstack cloud DC

Examination questions:

How does the application layer database application realize dual activity?

Technologies of database software, such as Oracle RAC, IBM DB2

What is the difference between synchronous remote replication and dual live?

Synchronous remote replication technology is used for primary and standby disaster recovery. When writing replication, the peer also writes IO successfully before returning to write success. Dual active is a dual active data center realized through the hypermetro feature. When one end fails, the service will automatically switch to the other end, and the service access will not be interrupted.

RTO RPO

Implementation mode

cost

Storage layer implementation

Multipath software settings

Service switching mode

Whether the business network is connected

Resource utilization

operation mode

Must RTO and RPO be equal to zero in a dual active data center?

Not necessarily. Whether the theory is equal to 0 is really equal to 0 depends on the reality

Is using smart virtualization necessarily heterogeneous storage?

Yes, smartvirtualization maps different models of Huawei storage or heterogeneous storage metadata to local storage.

Can the dual active host layer be dual active?

I didn’t say yes here. I tangled with the examiner for a while, and he said no. Ha and pro of two sites will be greater than zero

The host layer, such as fusionsphere’s ha DRS DPM cluster technology, realizes high availability of the host layer, but ha is only ha when the virtual machine fails, so the business has been interrupted. RTO > 0, so whether RTO RPO must be equal to 0 depends on the reality.

How to configure the host layer to form a cluster?

The sites at both ends join a cluster. Only when the sites at both ends communicate with each other can they join a cluster

How to realize the computing layer of dual active data center? How did you realize the fusionsphere large cluster technology you mentioned? What is the difference between a local cluster and a cross cluster?

Add servers from two data centers to the same large cluster. Ha \ DRS is mainly used. (I don’t know if this is the right answer. I don’t know the last two questions)

 

How to realize the network layer of dual active data center?

(when commenting on this, the examiner said that what he wanted to hear was realized through vxlan layer 2 Technology)

IP and FC network interworking between data centers. The SLB local load balancer is responsible for the request load in the data center, and the GSLB load is responsible for the request load between the data centers.