Linux operation and maintenance — 1. CEPH distributed storage architecture and working principle


CEPH theory

  1. About CEPH

CEPH is an open source project that provides a software defined, unified storage solution. CEPH is a distributed storage system with high performance, high scalability, large-scale expansion and no single point of failure.

CEPH is a software defined storage solution
CEPH is a unified storage solution
CEPH is a cloud storage solution
CEPH official document:

  1. CEPH architecture: distributed service process
    Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

2.1 Ceph Monitor(MON)

CEPH monitor service process, referred to as mon. It is responsible for monitoring the monitoring status of the cluster. All the cluster nodes report the status and the information of each status change to the mon node. Mon does its job by collecting this information and maintaining a cluster map. Cluster map contains maps of each component, such as mon map, OSD map, PG map, cross map and MDS map. In a typical cepb cluster, there are usually many mon nodes, among which quorum is a distributed decision mechanism. Paxos algorithm is used to keep the consistency of cluster map. Therefore, the number of mon nodes in the cluster should be an odd number. At least half of the mon nodes should be available at the beginning of the arbitration operation, so as to prevent the common brain crack problems in the traditional storage system. OSD can only report its own information in some special cases (e.g. add new OSD, OSD finds itself or others are abnormal), and usually only sends a simple heartbeat. When mon receives these escalation information, it will update the cluster map and spread it.

Mon map: maintains end-to-end information between mon nodes. It includes CEPH cluster ID, mon hostname, IP: port, and the creation version and last modification information of the current mon map to help determine which version the cluster should follow.
OSD map: maintains CEPH cluster ID, OSD number, status, weight, host, and pool related information (e.g. pool name, pool ID, pool type, number of replicas, PG ID), as well as the creation version and last modification information of OSD map.
PG map: maintain the version, timestamp, capacity ratio, latest OSD map version, PG ID, number of objects, up set of OSD, acting set of OSD, clean status and other information of PG.
Crush map: it maintains the storage device information of the cluster, the fault domain hierarchy and the rule definition of how to store data in the fault domain.
MDS map: maintains metadata pool ID, MDS quantity, status, creation time, current MDS map version and other information.

2.2 Ceph Object Storage Device Daemon(OSD)

CEPH object storage device service process, referred to as OSD. An OSD daemons are bound to a physical disk in the cluster. OSD is responsible for storing data in these physical disks in the form of object, and providing the same data when the client initiates a data request. In general, the total number of physical disks is equal to the number of OSD daemons in the CEPH cluster that are responsible for storing user data to each physical disk. OSD is also responsible for data replication, data recovery, data rebalancing and monitoring other OSD conditions through the heartbeat mechanism and reporting to mon. For any R / W operation, the client first requests the cluster map from Mon, and then the client can directly perform I / O operation with OSD. Because the client can directly operate with OSD without additional data processing layer, CEPH’s data transaction speed is so fast.
2.3 CEPH metadata server (MDS) [optional]

CEPH metadata server service process, MDS for short. MDS needs to be enabled only in a cluster with CEPH file storage (cephfs) enabled. It is responsible for tracking the file hierarchy, storing and managing cephfs metadata. MDS metadata is also stored on OSD in the form of object. In addition, MDS provides a shared continuous file system with intelligent cache layer, which can greatly reduce the frequency of OSD read and write operations.

  1. CEPH’s architecture: core components

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

3.1 Ceph RADOS(Reliable, Autonomic, Distributed Object Store)

Reliable, automatic and distributed object storage system, referred to as CEPH storage cluster. All the excellent features of CEPH are provided by Rados, including data consistency, high availability, high reliability, no single point of failure, repair to me and self-management of distributed object storage. The data access mode of CEPH (e.g. RBD, cephfs, radosgw and libraries) is based on Rados. Everything in cepb is stored in the form of objects, which Rados is responsible for. For data consistency of distributed storage, Rados implements data replication, fault detection and recovery, including data migration and rebalancing among cluster nodes.


Is a C language library, CEPH basic library for short. The functions of Rados are abstracted and encapsulated, and the North API is provided. Through librados, applications can directly access the native functions of Rados, which improves the performance, reliability and efficiency of applications. Because Rados is essentially an object storage system, the APIs provided by libraries are all object-oriented. The advantage of librados native interface is that it integrates directly with application code, which is very convenient to operate, but it will not actively fragment the uploaded data.

Librados supports an object storage interface with asynchronous communication capability:

Storage pool operations
    Read and write objects
    Create / delete
    Entire object range or byte range
    Append or truncate
Create / set / get / delete xattrs
Create / set / get / delete K / V pair
Compound operation and double ack semantics

3.3 Ceph Reliable Block Storage(RBD)

CEPH block storage, or RBD for short, is a block storage service interface based on librados. RBD drivers have been integrated into the Linux kernel (2.6.39 or later) and supported by QEMU / KVM hypervisor, all of which can access CEPH block devices seamlessly. Linux kernel RBD (krbd) maps CEPH block devices through librados, and then Rados stores the data objects of CEPH block devices in the cluster nodes in a distributed way.

3.4 Ceph RADOS Gateway(RGW)

CEPH object gateway, or rgw for short, is an object storage service interface based on librados. In essence, it is an agent, which can transfer HTTP requests to Rados, and can also convert Rados requests to HTTP requests, so as to provide a restful object storage service interface and be compatible with S3 and swift.

3.5 Ceph File System(CephFS)

CEPH file system, referred to as cephfs. A POSIX compliant distributed file system of any size is provided above the Rados layer. Cephfs relies on MDS to manage its metadata (file level results) so as to separate the metadata from the original data. MDS does not directly provide data to the client, so it can avoid single point of failure, which helps to reduce software complexity and improve reliability.
  1. CEPH’s architecture: Internals

4.1 Ceph Client

CEPH's client refers to all entity objects that use CEPH to store services. It may be a client software, a host / virtual machine or an app. The client will connect to CEPH storage server and use corresponding CEPH storage service according to different interface types (service interface, native interface, C Library) provided by CEPH.

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

4.1.1 client data striping (sharding)

When users use RBD, rgw, cephfs type client interfaces to store data, they will experience a transparent process of transforming data into Rados unified processing objects, which is called data striping or sharding.

Familiar with the storage system, you won't be unfamiliar with striping. It is a means to improve storage performance and throughput. By dividing orderly data into multiple sections and storing them on multiple storage devices, the most common striping is raid0. The striping of CEPH is similar to raid0. If we want to make full use of CEPH's parallel I / O processing ability, we should make full use of the client's striping function. It should be noted that the native interface of librados does not have the striping function. For example, if you use the librados interface to upload 1g files, the files falling on the storage disk are 1g size files. The objects in the storage cluster also do not have the striping function. In fact, the data is striped by the above three types of clients and then stored in the cluster objects.

Striping process:

The data is divided into stripe block.
Write the stripe unit block into the object until the object reaches the maximum capacity (4 m by default).
Then create a new object for the extra stripe cell block and continue to write data.
Cycle through the above steps until the data is completely written.

Suppose the upper limit of object storage is 4m, and each stripe unit block accounts for 1m. At this time, we will save a file of 8m size. Then the first 4m will be stored in object0, and the next 4m will create object1 to continue saving.

With the increase of the size of the storage file, the client data can be striped and stored in multiple objects. At the same time, the object will be mapped to different OSDs due to its mapping to different PG upwardness. In this way, the IO performance of the physical disk device corresponding to each OSD can be fully utilized, so that each parallel write operation can be carried out at the maximum rate. With the increase of the number of bands, the improvement of write performance is considerable. As shown in the following figure, the data is divided and stored in two object sets, and the order of stripe unit block storage is stripe unit 0 ~ 31.

There are three important parameters of CEPH that affect the striping

Order: indicates the object size. For example: order = 22 is 2 * * 22, which is 4m size. The size of the object should be large enough to match the stripe block, and the size of the object should be a multiple of the stripe block size. The object size recommended by RedHat is 16MB.
Stripe unit: indicates the width of the stripe unit block. The client divides the data written to the object into blocks with the same width (the width of the last block may not be the same). The stripe width should be a fraction of the object size. For example, if the object size is 4m and the cell block is 1m, then an object can contain four cell blocks. In order to make full use of object space.
Stripe_count: indicates the number of stripes. According to the number of stripes, the client writes a batch of stripe unit blocks to the object set.

Note: since the client will specify a single pool for writing, all data striped into objects will be mapped in the PGs contained in the same pool.
4.1.2 client monitoring and notification of objects

The client can register persistent listening to the object and maintain the session with the primary OSD. This is the client’s monitoring and notification of the object. This feature enables clients listening to the same object to use object as a communication channel.

4.1.3 exclusive lock of client

The exclusive lock feature of the client provides an exclusive lock for RBD block devices, which helps to solve the conflict caused by multiple clients trying to write data to the same object at the same time when they operate on the same RBD block device. The exclusive lock feature needs to rely on the client's monitoring and notification features of the object. When writing data, if one client first places an exclusive lock on the object, it can be detected by other clients before writing data, and the data writing is abandoned. If this feature is set, only one client can modify the RBD block device at the same time, which is often applied to the operations of snapshot creation and snapshot deletion to change the internal structure of the block device. The forced exclusive lock feature is not enabled by default. It needs to be explicitly enabled through the option -- image features when creating images.

rbd -p mypool create myimage --size 102400 --image-features 5
# 5 = 4 +1 
#1: enable layered feature
#4: enable exclusive lock feature

4.1.4 object mapping index of client

The client can track these objects when data is written to RBD image (establish mapping index), so that the client can know whether the corresponding objects exist when data is read and written, eliminating the overhead of traversing OSD to determine whether objects exist. The mapping index of object is saved in the memory of librbd client. This feature is favorable for some RBD image operations:

Shrink: delete only existing tail objects.
Export operation: export only objects that exist.
Copy operation: only objects exist.
Flatten: copies only existing parent objects to clone image.
Delete: delete only objects that exist.
Read: only objects that exist are read.

The object mapping index feature is not enabled by default. It also needs to be explicitly enabled through the option — image features when creating images.

rbd -p mypool create myimage –size 102400 –image-features 13
# 13 = 1 + 4 + 8
#1: enable layered feature
#4: enable exclusive lock feature
#8: enable object mapping index

4.2 CRUSH(Controlled Replication Under Scalable Hashing)

Controllable, extensible and distributed replica data placement algorithm. In essence, it is a pseudo-random data distribution algorithm, similar to the consistency hash. It is CEPH's intelligent data distribution mechanism, which manages the distribution of PG in the whole OSD cluster. Crush's purpose is very clear, that is, how a PG establishes a relationship with OSD, which is the gem of CEPH crown.

4.2.1 dynamic calculation metadata

The storage mechanism of traditional physical storage devices involves the storage of the original data and its metadata, which stores the address information of the original data stored in the storage node and disk array. Every time new data is written to the storage system, the metadata is updated first. The updated content is the physical address where the original data will be stored, and then the original data is written. CEPH discards this storage mechanism and uses cross algorithm to dynamically calculate the PG used to store objects, and also to calculate the OSD acting set of PG (the acting set is the active OSD set, and the first numbered OSD in the set is the primary OSD). Crush computes metadata on demand instead of storing metadata, so CEPH eliminates all the limitations of traditional metadata storage methods, and has better capacity, high availability and scalability.
4.2.2 fault domain / performance domain division based on crash Buck

In addition, crush also has a unique infrastructure awareness ability, which can identify the physical component topology (fault domain and performance domain) in the entire infrastructure, including disk, node, rack, row, switch, power circuit, room, data center and storage medium types, such as various crush bucket types, buckets The physical location of the device is indicated. This sensing capability enables the client to write data across the fault domain, so as to ensure the data security. Crush bucks are included in crush map. Crush map also stores a series of definable rules that tell crush how to copy data for different pools. Crush enables CEPH to self manage and self heal. When the component in the fault area fails, crush can sense which component has failed, and automatically perform the corresponding data migration, recovery and other actions. With crush, we are able to design a highly reliable storage infrastructure with no single point of failure. Crush also enables clients to write data to specific types of storage media, such as SSD, and SATA. Crush rules determine the scope of the failure domain and the performance domain.

4.3 Object

Object. It is the smallest storage unit of CEPH. Each object contains unique identification, binary data and metadata composed of a set of key value pairs in the cluster. The raw data and metadata bound together, and with the globally unique identifier oid of Rados. Whether RBD, cephfs or radosgw are applied in the upper layer, they will eventually be stored in OSD in the form of object. When Rados receives the data write request to the client, it converts the received data into object, and then the OSD daemons write the data to a file on the OSD file system.

4.4 Placement Group(PG)

Co location group, referred to as PG. PG is a logical set of objects. A CEPH storage pool may store millions or more data objects. Because CEPH has to deal with data persistence (replica or erasure code data block), cleaning, verification, replication, rebalancing and data recovery, there will be scalability and performance bottlenecks when using object as management object. CEPH solves this problem by introducing PG layer. Crush assigns each object to the specified PG, and then maps each PG to a set of OSDs. PG is an essential part to ensure the scalability and performance of CEPH. When the data is to be written to CEPH, the data will be first decomposed into a group of objects, then the hash operation will be performed and the PG ID will be generated according to the object name, replication level and the total PG number in the system, and finally the objects data will be distributed to each PG according to the PG ID. Without PG, it is quite difficult to manage and track the replication and propagation of millions of objects on thousands of OSDs. Taking PG which contains a large number of objects as the object of management and replication can effectively reduce the loss of computing resources. Each PG consumes a certain amount of computing resources (CPU, RAM), so the storage administrator should

Carefully calculate the number of PG in the cluster.
4.4.1 calculation of PG number

As mentioned above, the number of PG will affect the storage performance to a certain extent. The number of PG in an OSD can be dynamically modified. However, it is recommended to have a certain degree of confidence in the number of PG during the deployment planning period. Commonly, there are several ways to calculate the PG number:

Calculate the total number of cluster PG

Total PG = (total OSD x 100) / maximum copies

For example, if the cluster has 160 OSDs and the number of copies is 3, then the total number of PG calculated according to the formula is 5333.3, and then the value is rounded to the N-power of the closest 2, and the final result is 8192 PG.

Calculate the total number of PG subordinates in pool

Total PG = ((total OSD x 100) / maximum number of replicas) / number of pools

The results are also taken to the nearest power of 2.
4.4.2 PG and PGP

PG – Placement Group
PGP – Placement Group for Placement purpose
pg_num – number of placement groups mapped to an OSD
When pgnum is increased for any pool, every PG of this pool splits into half, but they all remain mapped to their parent OSD.Until this time, Ceph does not start rebalancing. Now, when you increase the pgpnum value for the same pool, PGs start to migrate from the parent to some other OSD, and cluster rebalancing starts. This is how PGP plays an important role.

As the name implies, PGP is a PG set for positioning. The number of PGPS should be consistent with the total number of PGs. For a pool, when you increase the number of PGs, the number of PGs in the pool should also be modified to keep the two consistent, so that the cluster can trigger the rebalancing action.

With the increase of PG num, the objects in the old PGs will be rebalanced to the new PGs. No object migration will be found between the old PGs and the original OSD mapping relationship will still be saved.
With the increase of PGP num, the new PGs will be balanced into OSDs, while the old PGs will still keep the original OSD mapping relationship
PGP determines the distribution of PG.

4.4.3 PG peering operation, acting set and up set

The OSD daemons will perform peering operations for each PG copy to ensure that the PG copies corresponding to the master-slave OSDs are consistent. These master-slave OSDs are organized in the form of an acting set. The first element of an acting set is the primary OSD, which holds the primary copy of the PG. The rest are replica OSDs that hold the second / third replica of PG (assuming the number of replicas is 3). Primary OSD is responsible for the peering operation of PG between replica OSDs. When the primary OSD status is down, it will be removed from the up set first, and then promoted from the second OSD to the primary OSD. The PG in the failed OSD will be restored by synchronizing to other OSDs, and the new OSD will be added to the actin set and up set to ensure the high availability of the whole cluster.

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle


Storage pool. It is an administrator oriented logical partition to isolate PGs and objects. In short, a pool is a namespace defined by an administrator. Different pools can have completely different data processing methods, such as replica szie (number of replicas), PG num, crush rules, snapshot, owner and authorization, etc. Storage pools can be created for specific types of data, such as block devices, object gateways, or just for multi-user isolation.

   Administrators can set the owner and access rights for the pool, and also support the snapshot function. When users store data through APIs, they need to specify which pool objects are stored in. Administrators can set related optimization strategies for different pools, such as the number of PG copies, the number of data cleaning, the size of data blocks and objects, etc. Pool provides an organized way to manage storage. Each pool will be cross distributed on the OSDs of the cluster nodes, which can provide enough flexibility. In addition to the number of PG copies, you can also specify the rule set of erasure codes to provide the same level of reliability as the number of copies, and only consume less space.

When the data is written to a pool, the rule set corresponding to the pool will be found in cross map first, which describes the replica number information of the pool. When CEPH cluster is deployed, some default storage pools (e.g. data, metadata, RBD) are created.

Pool also has the concept of capacity, but pool size is only useful in scenarios with maximum capacity or QoS limitation, not real capacity. Because the stealth between pools and OSDs is cross, the total capacity of pools is greater than CEPH cluster global size. So the total capacity of pools is meaningless.

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

  1. Design idea of CEPH
    Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

    Here, I would like to summarize the design meaning of each internal component of CEPH and the significance of its existence. Taking object as the smallest storage unit of CEPH fully demonstrates CEPH's great ambition to become a new generation (or cloud era) storage architecture. As we all know, metadata is the information about the original data, which determines where the original data will be stored and read from. Traditional storage systems track their metadata by maintaining a centralized lookup table, which directly leads to the problems of low performance (limited capacity), single point of failure (low reliability), data consistency (low scalability) and so on. In order to achieve this goal, CEPH must break the shackles of the traditional storage system, adopt a more intelligent metadata management method, store the original data and its metadata in a discrete way, and then locate the data in a dynamic calculation way, which is object + crush. On this basis, the introduction of Rados is to solve the problems of data consistency, high availability and self-management in the distributed operating environment; the introduction of PG is to reduce the management objects of Rados and the traversal addressing space of objects, further improve the performance and reduce the internal implementation complexity. In data migration, PG is also used as the migration unit. With PG, CEPH will try to avoid directly operating objects; pool is introduced to abstract a more friendly resource management and operation model. For example, the number of PG and the number of PG copies are set in pool, providing a good user experience. On the basis of such a reliable, automatic and distributed object storage system, a variety of upper interface types, such as librados, RBD, rgw, cephfs and so on, which are more convenient for applications or clients, are constructed to realize a unified storage solution with lower layer abstract and upper layer heterogeneous compatibility. Looking back, the reason for CEPH's success is simple. Isn't the "programmability" brought about by its SDS (software defined storage) gene exactly what the cloud era desires?
  2. Data reading and writing principle of CEPH
    6.1 three mapping processes of data storage
    Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

In CEPH storage system, data storage is divided into three mapping processes:

Step 1. Map (slice) the file data to be stored into objects (1: n) that can be processed by Rados: (iNO, ONO) – > oid

Ino: metadata of file, unique identification of file
Ono: the partition number of an object generated by file segmentation. By default, the partition size is 4m
Oid: the unique identification of the object, which stores the dependency between objects and files

Step 2. Map objects to PG (n:1): hash (OID) & mask – > PGID. CEPH specifies a static hash function to map oid to a pseudo-random value of approximately uniform distribution, and then phase this value with mask to get PG ID.

Mask: the total number of PG m - 1, M is the integer power of 2

Step 3. Map PG to OSDs (n: m): Crush (PGID) – > (osd1, osd2, osd3). Pass in PGID to crush algorithm to calculate a set of OSD arrays (the length is the same as the number of PG copies).

The file data is finally stored in these OSDs.

Pseudo code of scheduling algorithm:

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]​

6.2 I / O location of client

When the client reads and writes data, it first needs to establish a connection with the mon node and download the latest cluster map. The client reads the cross map (including cross rules, cross Buck) and OSD map (including the status of all pools and all OSDs) in the cluster map. The cross rules is the strategy of data mapping, which determines how many copies each object has and how these copies should be stored. When writing data to a pool, the pool name can be used to match to one CRUSH Rules。 The client uses this information to execute the crush algorithm locally to obtain the IP: port of the primary OSD for data reading and writing, and then can establish a direct connection with the OSD node to transmit data. Crush dynamic computing takes place on the client side and does not need a centralized master node for addressing. The client allocates this part of work, further reducing the workload of the server.

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

The client enters the pool name and object ID (for example: pool = "pool1", object = "obj 1").
Crush gets the object ID and hashes it.
Crush gets the PG ID according to the hash code and the total number of PG in the previous step. (e.g. hash is 186 after coding, and the total number of PG is 128, then the module is 58, so this object will be stored in PG ᛽).
Crush calculates the primary OSD of the corresponding PG ID.
The client gets the pool ID according to the pool name (e.g. pool1 = 4).
The client splices PG ID and pool ID (e.g. 4.58)
The client directly communicates with the main OSD in the actictin set set set to perform the IO operation of the object.

6.3 replica I / O of master-slave OSD
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

First of all, we need to understand that CEPH uses a master-slave model for reading and writing data. When the client reads and writes data, it can only send requests to the primary OSD corresponding to the object. At the same time, CEPH is also a highly consistent distributed storage, which means that the consistency synchronization process of distributed data is not asynchronous operation. Only when all the master-slave OSDs have written data can they be regarded as a successful write. The client uses crush algorithm to calculate the PG ID of the object mapping and the acting set list. The first element of the acting set is primary OSD, and the rest is replica OSD.

6.4 log I / O for performance improvement

Strong consistency of data synchronization brings high data consistency, but also has some disadvantages, such as low write performance. In order to improve the write performance, CEPH adopts a common coping method - log caching mechanism.

When there is a burst write peak, CEPH will first save some scattered and random IO requests to the cache for merging, and then uniformly send IO requests to the kernel. This method effectively improves the execution efficiency, but once the OSD node crashes, the data in the cache will also be lost. Therefore, CEPH OSD consists of two different parts: the OSD log part (journal file) and the OSD data part. Each write operation includes two steps: first, write the object to the log section, and then write the same object to the data section from the log section. When the OSD is restarted after a crash, it will automatically attempt to recover the cached data lost due to the crash from the journal. Therefore, the IO of journal is very intensive, and because one data needs IO twice, the performance of hardware is also lost to a large extent. From the perspective of performance, SSD storage is used to log in the production phase. Using SSDs can significantly improve throughput by reducing access time and read latency.

If SSD is used to store logs, we can create a logical partition on each physical SSD disk as the log partition, so that the log partition of each SSD can be mapped to an OSD data partition, and the default size of the journal file is 5g. In this kind of deployment scheme, it is important not to store too many logs on the same SSD to avoid exceeding its upper limit and affecting the overall performance. It is recommended that no more than 4 OSD logs should be stored on each SSD disk. However, it should be noted that another disadvantage of using a single SSD disk to store multiple logs is that it is prone to a single point of failure, so RAID1 is also recommended to avoid this situation.
6.5 UML process of data writing to OSD
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

Client creates the cluster handler.
The client reads the configuration file.
The client connects to the Mon to get a copy of the cluster map.
According to the fault domain and data distribution rule information recorded by crush map, find the primary OSD (primary OSD) for the client to read and write I / O.
According to the number of pool copies recorded in the OSD map (assuming that the number of copies is 3), first write the data to the primary OSD. The primary OSD replicates the same data to two replica OSDs and waits for them to confirm the write is complete.
When replica OSD completes the data writer, they will send an answer signal to the primary OSD.
Finally, the primary OSD returns a reply signal to the client to confirm the completion of the whole write operation.

6.6 UML process of data writing to new primary OSD
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

The client connects to Mon to get a copy of cluster map.
Since the new primary OSD does not have any PG, it proactively reports that mon is temporarily upgraded from primary OSD 2 to temporary primary OSD.
The temporary primary OSD synchronizes the PG to the new primary OSD.
Client I / O directly connects to the temporary primary OSD.
After receiving the data, the temporary primary OSD copies the other two OSDs at the same time, and waits for them to confirm the completion of writing.
Three OSDs write data synchronously, and then ask a reply signal to the client to confirm the completion of the whole write operation.
If the full data of the temporary primary OSD and the new primary OSD is completed, the temporary primary OSD is demoted to replica OSD again.
  1. Self management of CEPH
    7.1 heartbeat mechanism of OSD

Heartbeat mechanism is the most common fault detection mechanism. CEPH uses heartbeat mechanism to detect whether OSD is running normally, so as to find the fault node in time and enter the corresponding fault processing flow. The up / down status of OSD reflects whether it is healthy or not. After joining the cluster, OSD will report the heartbeat to mon on a regular basis. However, if OSD has failed, it will not continue to report its down status. Therefore, OSDs with associated relationship will also judge each other’s status. If they find that the other’s down status is lost, they will assist in reporting to mon. Of course, mon will ping these OSDs regularly to determine their operation. CEPH can determine whether the OSD node is invalid by two ways: reporting the failed node by the partner OSD and counting the heartbeat detection from OSD by mon.

The OSD service process listens for four ports: public, cluster, front, and back:

Public port: listen for connections from Mon and client.
Cluster port: listen for connections from OSD peer.
Front port: listen for the network card used by the client to connect to the cluster.
Back port: listen for the network card used inside the cluster.

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

Characteristics of OSD heartbeat mechanism:

Timely: the partner OSD can discover the failed node at the second level and report it to the Mon, and the mon will offline the failed OSD within a few minutes.
Appropriate pressure: due to the collaborative reporting of partner OSD, the heartbeat detection between mon and OSD is more like an insurance measure, so the interval between OSD sending heartbeat to mon can be as long as 600 seconds, and the detection threshold of mon can be as long as 900 seconds. CEPH distributes the pressure of the central node to all OSDs in the process of fault detection, so as to improve the reliability of the central node mon and the scalability of the whole cluster.
Tolerance of network jitter: after receiving OSD's report to its partner OSD, mon does not immediately offline the target OSD, but periodically waits for the following conditions:
    The failure time of the target OSD is greater than the threshold determined dynamically by the fixed OSD ﹣ heartbeat ﹣ grace and the historical network conditions.
    Reports from different hosts reach mon ﹣ OSD ﹣ min ﹣ down ﹣ reports.
    The failure report has not been cancelled by the source OSD before the first two conditions are met.
Diffusion: as the central node, the mon does not attempt to broadcast and notify all OSDs and clients after updating the OSD map, but waits lazily for OSDs and clients to obtain. This reduces the pressure on Mon and simplifies the interaction logic.

7.2 heartbeat detection between OSDs

Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

OSDs in the same PG send PING/PONG messages to each other.
Every 6S.
No heartbeat reply is detected in 20s, and the other party is added to the failure queue.

7.3 heartbeat mechanism between OSD and mon
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

7.4 PG migration and rebalancing during OSD expansion

When a new OSD is added to the CEPH cluster, the cluster map will also be updated. This change will change the parameters entered in crush calculation, and indirectly change the location of the object. Crush algorithm is pseudo-random, but it will evenly place data, so crush will start to perform rebalancing operation, and will make as few data as possible to migrate. Generally, the amount of migrated data is the ratio of the total amount of cluster data to the amount of OSD. For example, in a cluster with 50 OSDs, when a new OSD is added, only 1 / 50 (2%) of the data will be migrated. And all old OSDs will move data in parallel, so that it can be completed quickly. In the production environment, for CEPH clusters with high utilization (large amount and fast write speed), it is recommended to set the new OSD weight to 0, and then gradually increase the weight. In this way, the rebalancing operation can reduce the impact on the performance of CEPH cluster. The rebalancing mechanism ensures that all disks can be used evenly, so as to improve cluster performance and maintain cluster health.

Before expansion:
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

After expansion:
Linux operation and maintenance -- 1. CEPH distributed storage architecture and working principle

As can be seen from the above figure, the dotted PGs will automatically migrate to the new OSD.
7.5 data erasure (clean)

Data erasure is a way for CEPH to maintain data consistency and neatness. The OSD daemons will clean up the objects in the PG, compare the metadata information of the objects in the PGs between replicas, and catch exceptions or some errors in the file system. Such a clean is a day level scheduling policy. OSD only compares the data in a deeper level, and compares the data bit by bit. This deep level comparison can find the bad sectors on the drive disk, which is generally a week level scheduling strategy.