Original author: Sijie Guo
Translation: streamnnative Sijia
Apache bookkeeper is optimized for real-time workload and is a scalable, fault-tolerant, low latency log storage service. Initially developed by Yahoo! Research, bookkeeper was incubated as a sub project of Apache zookeeper in 2011, and finally launched as a top-level project of Apache in January 2015. Since its initial introduction, companies such as twitter, Yahoo! And salesforce have widely used bookkeeper to store and service important data in a variety of use cases. In this article, I’ll show you how bookkeeper ensures persistence, consistency, and low latency. I’ll also focus on the guarantees and key features of bookkeeper, which are open source.
In the last article, I gave a technical overview of Apache bookkeeper and introduced some related concepts and terms. A bookkeepercolonyinclude:
- Bookies: a set of independent storage servers
- Metadata Store System: for service discovery and metadata management
Bookkeeper clients can use a higher level of distributedlog API (also known asLog stream API）Or lower levelledger API。 The ledger API allows users to interact directly with books. The following figure is a typical example of a bookkeeper installation.
! [figure 1 Typical bookkeeper installation (application connected through multiple APIs)
Streaming storage requirements
As mentioned in the introduction to Apache bookkeeper, a real-time storage platform shouldmeanwhileMeet the following requirements:
- Even under strong persistence conditionsThe client can also read and write the entry stream with very low latency (less than 5 ms)
- It can store data persistently, consistently and fault tolerance
- When writing, the client can stream or rear end
- Effective storage of data, support access to historical dataAndreal-time data
Bookkeeper provides the following assurancemeanwhileMeet the above requirements:
|Multiple copies||Copy data and store it persistently on multiple machines or in multiple data centers to ensure fault tolerance.|
|persistence||After successful replication, data can be stored persistently. Send a confirmation to the clientFront, force fsync to be enabled.|
|uniformity||A simple and repeatable consistency model is used to ensure the consistency between different readers.|
|usability||Improve read-write availability, consistency and persistence by means of enable change and speculative read.|
|Low latency||I / O isolation is used to protect read and write latency while maintaining consistency and persistence.|
Bookkeeper replicates each data record and stores multiple copies (usually three or five copies) on multiple machines within a data center or between multiple data centers. Some distributed systems use master / slave or pipeline replication algorithms to replicate data between replicas (for example, Apache HDFS, CEPH, Kafka, etc.). The difference of bookkeeper lies in the use ofQuorum vote parallel replication algorithmTo replicate data to ensure predictable low latency.Figure 2That is, multiple copies in the bookkeeper integration.
In the figure above:
- Select (automatically) a set of books from the bookkeeper cluster (books 1-5 in the legend). This set of books is givenledgerFor storing data records onensemble。
- Data in ledgerdistributionIn the enable of books. In other words, there are multiple copies of each record. Users can configure the number of copies at the client level, i.eWrite quorum size。 In the above figure, the write quorum size is 3, that is, the records are written to bookie 2, bookie 3 and bookie 4.
- When a client writes a data record to ensemble, it needs to wait until there are a specified number of copies to send acknowledgement (ACK). The number of copies isACK quorum size。 After receiving the specified number of acks, the client writes successfully by default. In the above figure, the size of the ACK quorum is 2, that is, for example, bookie 3 and bookie 4 store data records, and a confirmation is sent to the client.
- When bookie fails, the composition of ensembles will change. Normal books will replace terminated books, which may only be temporary. For example, ifBookie 5Termination,Bookie xIt may replace it.
Multiple copies: the core idea
Bookkeeper multi copy is based on the following core concepts:
- Log streamRecord oriented rather than byte oriented。 This means that data is always stored as indivisible records (including metadata) rather than as a single byte array.
- The order of the records in the log (stream) is separated from the actual storage order of the record copies.
These two core concepts ensure that bookkeeper multiple copies can achieve the following functions:
- Provides multiple options for writing records to books, ensuring that even if multiple books in the cluster terminate or run slowly, the write operation can still complete (as long as there is enough capacity to handle the load). This can be achieved by changing the ensemble.
- The bandwidth of a single log (stream) is maximized by increasing the size of ensembles, so that a single log is not limited by one or a group of machines. This can be achieved by configuring the enable size to be greater than the write quorum size.
- By adjusting the size of ACK quorum, the delay of appending is improved. This is important to ensure that bookkeeper has low latency and provides consistency and persistence.
- Fast with many to many replica recoveryReproductionFurther replication creates more copies of records that are under replicated, for example, the number of copies is less than the size of the write quorum. All books can be used as a provider of record copiesAndRecipients.
Ensure that each data record written to bookkeeper is copied and persisted to the specified number of books. This can be achieved by using disk fsync and write acknowledgement.
- On a single bookie, the data record is explicitly written to (fsync enabled) disk before the acknowledgement is sent to the client, so that the data can be persisted in the event of a failure. This ensures that the data written to the persistent storage does not depend on the power supply and can be read again.
- In a single cluster, replication data is recorded to multiple books to achieve fault tolerance.
- The ACK data record is only performed when the client receives a specified number of bookies responses (specified by the ACK quorum size).
The latest NoSQL type databases, distributed file systems, and messaging systems (for example, Apache Kafka) all assume that the most effective way to ensure optimal persistence is to copy data into the memory of multiple nodes. But the problem is that these systems allow for potential data loss. Bookkeeper is designed to provide stronger persistence guarantee and completely prevent data loss, so as to meet the strict requirements of enterprises.
Ensuring consistency is a common problem in distributed systems, especially when multiple replicas are introduced to ensure long-term and high availability. Bookkeeper provides a simple and powerful consistency guarantee for the data stored in the log (repeatable read consistency)
- If the record has been referenced by program ACK, you mustimmediatelyReadable.
- If the record is read once, you mustthroughoutReadable.
- If recordedRIf it is successfully written, theRAll previous records have been successfully submitted / saved and will always be readable.
- The order in which records are stored must be identical and repeatable between different readers.
This repeatable read consistency is determined by lastaddconfirmed in bookkeeper（LAC）Protocol implementation.
Under the condition of cap (consistency: consistency, availability, partition tolerance), bookkeeper is a CP system. But in reality, Apache bookkeeper can still provide high availability even with hardware, network, or other failures. In order to ensure the high availability performance of writing and reading, bookkeeper adopts the following mechanism:
|High availability type||mechanism||explain|
|Write high availability||Ensemble change||When the bookie that is writing data fails, the client will reselect the placement of the data). This ensures that writes can always be done when there are enough bookies left in the cluster.|
|Read high availability||random block read||Some systems only read data from a storage node designated as a leader, such as Apache Kafka. The difference of bookkeeper is that the client can download theany Bookie reads the record. This helps to spread the read traffic across the individual bookies, while also reducing the latency of trailing reads.|
Strong persistence and consistency are complex problems in distributed systems, especially when the distributed systems need to meet the enterprise level low latency. Bookkeeper meets these requirements in the following ways:
- On a single bookie, the bookie server is designed for I / O isolation between different workloads (write, tail read, catch read / random read). Deploy on JournalGroup committing mechanismTo balance latency and throughput.
- useQuote vote parallel replication schemaMitigate latency losses due to network failures, JVM garbage collection stalls, and slow disk operation. This can not only improve the rear end delay, but also ensure the predictable low p99 delay.
- A long polling mechanism is used to send a notification to the trailing writer and send the record immediately after the ACK and confirmation of the new record.
Finally, it is worth mentioning that the persistence and repeatable read consistency of fsync and write acknowledgement are very important for state processing, especially for efficient once processing of streaming applications.
This article explains how bookkeeper ensures its persistence, consistency, high availability and low latency. I hope this article provides strong support for you to choose bookkeeper as a real-time workload storage platform. In future articles, I’ll take a look at how bookkeeper replicates data and how it uses a mechanism that guarantees consistency and persistence with low latency.
If you are interested in bookkeeper or distributedlog, you can join our community through bookkeeper email list or bookkeeper slack channel. You can also click here to download the latest version (version 4.5.0) of bookkeeper.