Google File System

Time:2021-10-27

Design

  1. The system consists of a large number of cheap machines. Component failure is a common phenomenon, which needs to monitor the operation, fault tolerance and recovery
  2. Most files exceed 100MB, GB level is common and is the main optimization object. Small files are supported, but high efficiency is not guaranteed
  3. Read operations include streaming reads and random reads. Streaming reads read hundreds of KB to 1MB of content each time, while random reads often involve only a few KB and do not guarantee high efficiency
  4. Write operations include append write and random write. Once append write is written, it is rarely modified. High efficiency is not guaranteed for random write
  5. Compared with low latency, it is more inclined to continuously high bandwidth

framework

Google File System

The system consists of a master and multiple chunkservers. A file is divided into multiple chunks of fixed size (64MB by default) and multiple backups (3 by default) are stored on different chunkservers

Master stores the metadata of the file system, including namespace, access control information, mapping from file to chunks, current locations of chunks

The master periodically issues instructions and monitors the status with the chunkserver through the heartbeat packet

The client communicates with the master to obtain the metadata of the file and cache it, and then communicates directly with the chunkserver for reading and writing

realization

Chunk Size

The chunk size is set to far exceed the 64MB of common file systems, and new chunks are allocated only when necessary. A larger chunk size has both advantages and disadvantages for the system

advantage:

  1. Reduce the number of interactions with the master when reading and writing multiple chunks
  2. The operation is more centralized on a single chunk, which reduces the network overhead of TCP connection
  3. The metadata stored on the master is reduced so that the metadata on the master can be maintained in memory

Disadvantages:

  1. Hot spots may appear in small files composed of one chunk, but files in the system are often composed of multiple chunks

Metadata

The master maintains three types of metadata in memory, including namespace, access control information, mapping from file to chunks, and current locations of chunks

The namespace and access control information are persistently stored on the local disk of the master through the operation log and backed up remotely

Current locations of chunks actively ask when the master is started and when a new chunkserver is added

Memory limit

Metadata is strictly limited to be maintained in memory, which reduces the time of master operation, but the memory size limits the size of GFS stored data

In fact, a lot of compression has been done for the storage of metadata. 64MB chunk can be maintained through data less than 64 bytes, and support larger capacity by expanding memory. It is very cheap

Chunk Locations

When the master starts and a new chunkserver is added, the master will actively ask and store the chunk location

Because the master controls the placement of the chunk and monitors the heartbeat packet, the chunk Locations information can be kept up to date

Operation Log

The operation log records the update of metadata and defines the sequence of concurrent operations as a logical timeline. Files, chunks and their version numbers are uniformly managed through the logical time at the time of creation

The operation log itself needs to be stored reliably, and it must be invisible to the client before the update of metadata is persisted. Otherwise, the downtime of the master may lead to the loss of files or operations

Therefore, the operation log is stored on multiple machines and will respond to the client’s operation only after the local and remote records are successful

When the Master goes down, restart by redoing the operation log. In order to ensure the speed of restart, when the operation log reaches a certain size, you need to make a checkpoint for the state of the master. For restart, you only need to redo the logical timeline in the operation log after the last checkpoint was created

When a checkpoint fails to be created, you only need to recover from the last checkpoint. Although the recovery time may be increased, the reliability can still be guaranteed

Consistency model

The atomicity of the namespace update is guaranteed by the master, and the reliability is provided by the operation log

Google File System

consistent:Client sees the same data in all replicas

defined:The data written by the client can be seen completely

For record append, ensure that appended atomically at least once. GFS may insert padding and record duplicates between multiple defined regions, resulting in inconsistency. However, GFS can monitor and manage this part of data, which is not visible to the client

The concurrent update operation obtains the logical order in the operation log and executes it, and detects whether the chunk data is backward through the chunk version. The backward data will no longer participate in the update operation, and the master will not return its chunk location. It waits for the next garbage collection

Since the client caches the metadata, it is possible to directly access the chunk containing backward data. The expiration time and opening the file will cause the client to re request metadata from the master. In addition, when the operation is append, the backward chunk will return an abnormal end of the file, causing the client to re request metadata from the master

In order to solve the storage data failure caused by component failure, GFS will detect it through checksum and recover it as soon as possible through backup copies. When all backups are unavailable, the data is really lost. However, at this time, only abnormal data will be received instead of unknown data

File content update

Lease

In order to ensure data consistency, update operations need to be performed in the same order in all replicas. In order to reduce the load on the master, the master refers to a replica of the chunk as the primary and has a certain expiration time. All update operations are uniformly managed by the primary. Client update requests are directly delivered to the primary replica

Google File System

In addition, when writing across chunks, the operation will be segmented, resulting in write coverage, but it can still ensure consistency

Atomic Record Append

The process is similar to the write process, but in order to ensure atomicity, the primary checks whether the remaining capacity of the chunk can accommodate the data of this operation. If not, the primary and secondary will fill until the chunk is full and notify the client to retry the operation. In this way, the operation will fall into the new chunk created in the next request

Master operation

Namespace Locking

The master does not use prefix data structure management (similar to dictionary tree) for namespace management, but uses prefix compression algorithm to save space. At the same time, each node also contains a read-write lock

When the namespace needs to be updated, the read lock is continuously obtained from the root directory to the upper level of the target path, and then the write lock of the target path is obtained. In order to avoid deadlock when obtaining the lock, the order of obtaining the lock must be in strict accordance with the namespace level

Chunk Creation

create

When the master creates a chunk, select the chunkserver according to the following principles

  1. Low hard disk utilization
  2. Limit the number of recent creation operations of a chunkserver to evenly share the upcoming write operations
  3. Try to distribute multiple copies in different racks

re-replicates

When the number of available backups is less than the target number, re replicate is required. When multiple chunks need re replicate, it is sorted according to the following factors

  1. Gap from target number
  2. Priority is given to copying backups of active files rather than recently deleted files
  3. In order to minimize the impact of failed chunks on running applications, chunks that block the processing flow of the client program take precedence

During replication, when selecting which replica to replicate, the selection is based on the following factors

  1. Balanced hard disk utilization
  2. Limit the number of replication operations in progress for a chunkserver
  3. Try to distribute multiple copies in different racks

rebalance

The master periodically performs load balancing on replicas. It checks the current replica distribution, and then moves the replicas to make better use of hard disk space and perform load balancing more effectively

The master removes the replicas on the chunkserver with less than average free space, so as to balance the overall hard disk utilization of the system

Garbage Collection

GFS will not recycle physical space immediately after file deletion, but only during file and chunk level regular garbage collection

When a file is deleted by the application, the master first records it in the operation log, and then changes the file name to a hidden name containing the deletion time. When the master scans the namespace regularly, the metadata of the hidden file before a certain number of days is automatically deleted. Therefore, the deletion can be cancelled by renaming before deleting the metadata

In addition, during routine scanning, the master will delete the metadata of chunks that are not referenced by any files, and tell the chunkserver in the heartbeat information which chunks belong to it no longer exist, and the chunkserver can delete them by itself

Garbage collection has the following advantages over direct deletion:

  1. Garbage collection is simple and reliable in case of component failure
  2. The garbage collection task is executed by the master periodically and uniformly, with scattered overhead
  3. Provide security for accidental deletion

The main problem of garbage collection is that it will lead users to optimize the utilization of storage space. Repeated creation and deletion of temporary files will lead to a large amount of storage space that cannot be reused immediately. You can speed up the time by deleting the displayed again, or adopt different replication and recycling strategies for the namespace

Stale Replica Detection

When the master signs a lease, the chunk copy will be accompanied by the chunk version number. The client and chunkserver will verify the version number during operation to ensure that the accessed chunk data is up-to-date

Master fault tolerance

When the master process fails, it can be restarted quickly according to the operation log and checkpoint

After the master machine goes down, since the operation log is backed up by multiple machines, you can restart the master process on other machines

In addition, there is a shadow master, which provides read-only access to the file system after the master machine goes down. When the master is running normally, they communicate with the master and chunkserver to update the metadata. Shadow master cannot guarantee that the metadata is always up-to-date, but the file content will not expire