DDIA study notes


Chapter I reliability, scalability and maintainability

​ Reliability: System inAdversity(hardware failure, software failure, human error) can still work normally (complete the function correctly and reach the expected performance level.

ReliabilityIt means that the system can work normally even in case of failure. Failures can occur in hardware (usually random and irrelevant), software (usually systematic bugs that are difficult to handle), and humans (inevitably making mistakes from time to time).Fault tolerant technologyCertain types of faults can be hidden from the end user.

​ Scalability: there is a reasonable way to cope with the growth of the system (data volume, traffic, complexity)

ScalabilityThis means that there are policies to maintain performance even when the load increases. In order to discuss scalability, we first need a method to quantitatively describe the load and performance. We briefly looked at the example of twitter homepage timeline, introduced the method to describe the load, and used the response time percentile as a way to measure performance. Can be added in a scalable systemProcessing capacityTo remain reliable under high loads.

​ Maintainability: many different people (engineers, O & M) can work on the system efficiently in different life cycles (to maintain the existing behavior of the system and adapt to new application scenarios)

MaintainabilityThere are many aspects, but it is essentially about the quality of life of engineers and O & M teams. Good abstraction can help reduce complexity and make the system easy to modify and adapt to new application scenarios. Good operability means good visibility into the health status of the system and effective management means.

Chapter 2 data model and query language

Document databaseThe application scenario of is: data is usually self-contained, and the relationship between documents is very rare

Graphic databaseUsed in the opposite scenario: anything can be associated with anything

Chapter III storage and retrieval

​ At a high level, we can see that storage engines are divided into two categories: optimizationTransaction processing (OLTP)orOnline analysis (OLAP)。 There are great differences between the access modes of these use cases:

  • OLTP systems are usually user oriented, which means that the system may receive a large number of requests. To handle the load, applications typically access only a small number of records in each query. The application uses a key to request records, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.

  • Data warehouses and similar analysis systems will keep a low profile because they are mainly used by business analysts rather than end users. Their queries are much less than OLTP systems, but usually each query is expensive and requires millions of records to be scanned in a short time. Disk bandwidth (not discovery time) is often the bottleneck, and columnar storage is an increasingly popular solution for this workload.

    Chapter IV coding and evolution

​ JSON XML encoding — > development of binary encoding

​ Binary encoding technology: Apache thrift / protocol buffers (protobuf)/


​ Data flow in service: rest & RPC

​ Rest is not a protocol, but a design philosophy based on HTTP principles. It emphasizes simple data formats, uses URLs to identify resources, and uses HTTP functions for cache control, authentication, and content type negotiation. Compared with soap, rest has become more and more popular, at least in the context of cross organizational service integration, and is often related to microservices. APIs designed according to rest principles are called restful, and usually involve less code generation and automation tools.

​ Differences between RPC calls and local function calls:

  • Local function calls are predictable, and success or failure depends only on the parameters that you control. Network requests are unpredictable: requests or responses may be lost due to network problems, or the remote computer may be slow or unavailable. These problems are completely beyond your control. Network problems are common, so you must predict them, for example, by retrying failed requests.
  • A local function call either returns a result, throws an exception, or never returns (because it enters an infinite loop or the process crashes). The network request has another possible result: it may return no result due to timeout. In this case, you have no idea what happened: if you don’t get a response from the remote service, you can’t know whether the request passed.
  • If you retry a failed network request, it may occur that the request is actually passing, and only the response is lost. In this case, retrying will cause the operation to be performed multiple times unless you introduce de duplication in the protocol(Idempotence)Mechanism. Local function calls do not have this problem.
  • Each time a local function is called, it usually takes approximately the same time to execute. The network request is much slower than the function call, and its delay is very variable: it may be completed in less than a millisecond, but when the network is congested or the remote service is overloaded, it may take a few seconds. It is exactly the same thing.
  • When calling a local function, you can efficiently pass a reference (pointer) to an object in local memory. When you make a network request, all these parameters need to be encoded into a series of bytes that can be sent over the network. It doesn’t matter, if the parameter is a basic type like number or string, but it will soon become a problem for large objects.

Chapter V reproduction

​ Replication means that copies of the same data are kept on multiple machines connected through the network. Reasons for replication:

  • Make data geographically close to users (thereby reducing latency)
  • Even if a part of the system fails, the system can continue to work (thus improving availability)
  • Expand the number of machines that can accept read requests (thereby improving read throughput)

​ Replication algorithm:Single leaderMulti leaderandLeaderless

​ Replication principles based on leaders:

  1. One of the copies is designated asLeader, also known asMaster primary。 When a client wants to write to the database, it must send a request toleader, leaders write new data to their local storage.
  2. Other copies are calledFollowers, also known asRead replicasSlavesStandby databases (sencondaries)Hot standbyi。 Whenever a leader writes new data to local storage, it also sends data changes to all followers, known asReplication logRecord orChange stream。 Each follower pulls logs from the leader and updates its local database copy accordingly by applying all writes in the same order that the leader handles them.
  3. When the customer wants to read data from the database, it can query the leader or follower. But only leaders can accept writes (from the client’s point of view, the libraries are read-only).


​ Synchronous and asynchronous replication:


​ The advantage of synchronous replication is that the slave database ensures the latest data copy consistent with the master database. If the master database suddenly fails, we can be sure that these data can still be found on the slave database. The disadvantage is that if the synchronous slave library does not respond (for example, it has crashed, or there is a network failure, or any other reason), the master library cannot handle the write operation. The primary library must block all writes and wait for the synchronous copy to be available again.

​ Therefore, it is impractical to set all slave libraries to be synchronized: the interruption of any node will lead to the stagnation of the whole system. In fact, if you enable synchronous replication on a database, it usually means thatOneFollowers are synchronous, while others are asynchronous. If the synchronous slave becomes unavailable or slow, synchronize an asynchronous slave. This ensures that you have the latest data copies on at least two nodes: the master database and the synchronous slave database. This configuration is sometimes referred to asSemi synchronous

​ Typically, leader based replication is configured to be fully asynchronous. In this case, if the primary library fails and cannot be recovered, any writes that have not yet been replicated to the secondary library will be lost. This means that even if the client has been confirmed to be successful, the write cannot be guaranteedDurable。 However, a fully asynchronous configuration also has the advantage that the master library can continue to process writes even if all the slave libraries fall behind.

​ How to ensure that the new slave library has an accurate copy of the master library data?

  1. Obtain a consistent snapshot of the master database at some point (if possible) without locking the entire database. Most databases have this feature because it is required for backup. For some scenarios, third-party tools may be required, such as MySQL’s innobackupex [12].
  2. Copy the snapshot to the new slave library node.
  3. All data changes that occur after the slave library is connected to the master library and the snapshot is pulled. This requires that the snapshot be accurately associated with the location in the primary library replication log. The location has a different name: for example, PostgreSQL calls itLog sequence number (LSN)MySQL calls itBinary log coordinates
  4. When the backlogged data changes after the snapshot is processed from the library, we call itCatch upThe main library is. Now it can continue to handle the data changes generated by the main library.

​ Node downtime:

​ Slave library failure: catch up recovery

​ On its local disk, each slave library records data changes received from the master library. If the slave library crashes and restarts, or if the network between the master library and the slave library is temporarily interrupted, it is easier to recover: the slave library can know from the log the last transaction processed before the failure. Therefore, the slave library can connect to the master library and request all data changes that occur when the slave library is disconnected. When all these changes are applied, it catches up with the main library and can continue to receive the data change flow as before.

​ Primary library failure: failover

  1. Confirm that the main library is invalid. There are many things that can go wrong: crashes, power outages, network problems, and so on. There is no foolproof way to detect what is wrong, so most systems are simply usedTimeout: nodes frequently send messages back and forth to each other, and if a node does not respond within a period of time (for example, 30 seconds), it is considered to be hung (it does not count if the master database is intentionally closed due to planned maintenance).
  2. Select a new master library. This can be done by the election process (the master library is elected by a majority of the remaining replicas), or by the previously selectedController nodeTo specify a new master library. The best candidate for the master database is usually the slave database with the latest copy of the old master database (minimizing data loss). Let all nodes agree that a new leader is aConsensusProblems will be discussed in detail in Chapter 9.
  3. Reconfigure the system to enable the new master library. Clients now need to send their write requests to the new master database. If the old leader comes back, they may still think they are the master database, and they are not aware that other replicas have removed it. The system needs to ensure that the old leaders recognize the new leaders and become a slave library.

Trouble of failover:

​ If asynchronous replication is used, the new primary library may not receive the last write operation before the old primary library goes down

​ If the database needs to be coordinated with other external storage, it is extremely dangerous to discard writes.

​ When some failures occur, two nodes may think that they are the primary database. This situation is calledSplit brain。 Some systems take security precautions: when two primary database nodes are detected to exist at the same time, one node II will be closed, but the rough design mechanism may eventually lead to the closing of both nodes.

Implementation of replication log:

​ 1. Statement based replication

​ 2. Transfer pre write log (wal)

​ 3、Logical log replication (row based)

​ Another method is that the replication and storage engines use different log formats, which can separate the replication logs from the storage engine. This replication log is called a logical log to distinguish it from the (physical) data representation of the storage engine.

​ 4. Trigger based replication

Replication latency issues

​ 1. Read your own writing


Figure read data from the old copy after the user writes. Read after write consistency is required to prevent such exceptions

​ 2. Monotonic reading


Figure users first read from the new copy and then from the old copy. Go back in time. To prevent this kind of exception, we need monotonous reading.

​ 3. Consistent prefix read


Figure if some partitions copy slower than others, the observer may see the answer before seeing the problem.

Solution to replication latency: Transactions (costly performance and availability) and other alternative mechanisms

Multi master replication

Application scenarios

​ 1. Operation and maintenance of multiple data centers


Figure multi master replication across multiple data centers

​ 2. Clients that need to be offline

​ Consider calendar applications on mobile phones, laptops and other devices. Whether the device has an Internet connection or not, you need to be able to view your meeting at any time (send a read request) and enter a new meeting (send a write request). If any changes are made in the offline state, the device needs to be synchronized with the server and other devices when it goes online next time.

In this case, each device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi master replication process when the calendar copies on all devices are synchronized. Replication latency can be hours or even days, depending on when the Internet is accessible.

From the perspective of architecture, this setting is actually similar to multi leader replication between data centers. Each device is a “data center”, and the network connection between them is extremely unreliable. From the tattered implementation of various calendar synchronization functions in history, we can see how difficult it is to match multiple activities.

​ 3. Collaborative editing

​ We don’t usually think of collaborative editing as a database replication problem, but it has many similarities with the offline editing use case mentioned above. When a user edits a document, the changes are immediately applied to its local copy (the document state in a web browser or client application) and asynchronously replicated to the server and to any other user who edits the same document.

If you want to ensure that there are no editing conflicts, the application must obtain the lock of the document before users can edit it. If another user wants to edit the same document, they must first wait until the first user submits the changes and releases the lock. This collaboration mode is equivalent to the replication of a single leader who trades on the leader.

However, to speed up collaboration, you may want to set the changed units very small (for example, a key) and avoid locking. This approach allows multiple users to edit at the same time, but also brings all the challenges of multi leader replication, including the need to resolve conflicts.

Handling write conflicts(the biggest problem of multi leader replication)


Figure write conflict caused by two master databases updating the same record at the same time

Unowned replication

Write to database when node fails


Figure 5-10 arbitration write, legal read, and read repair after node interruption.

Detect concurrent writes


Figure concurrent write to dynamo style data store: there is no clearly defined order.

Capture “before” relationships

​ Initially, the shopping cart was empty. Between them, the client sends five writes to the database:

  1. Client 1 adds milk to the shopping cart. This is the first write of the key. The server successfully stores it and assigns version number 1 to it. Finally, the value and version number are sent back to the client.
  2. Client 2 adds eggs to the shopping cart. I don’t know that client 1 adds milk at the same time (client 2 thinks its eggs are the only item in the shopping cart). The server assigns version number 2 to this write and stores the egg and milk as two separate values. Then it takes these two valuesallReverse to client 2 and attach version number 2.
  3. Client 1 does not know the writing of client 2, and wants to add flour to the shopping cart, so it thinks that the current shopping cart content should be [milk, flour]. It sends this value to the server along with the version number 1 previously provided by the server to client 1. The server can know from the version number that the writing of [milk, flour] replaces the previous value of [milk], but is different from the value of [egg]Concurrencyof Therefore, the server assigns version 3 to [milk, flour], overwrites the version 1 value [milk], but retains the version 2 value [egg], and returns all values to client 1.
  4. At the same time, client 2 wants to add ham, but it is unknown that client 1 has just added flour. Client 2 received two values [milk] and [egg] from the server in the last response, so client 2 now combines these values and adds ham to form a new value, [egg, milk, ham]. It sends this value to the server with the previous version number 2. The server detected that the new value will overwrite version 2 [egg], but the new value will also be different from version 3 [milk, flour]ConcurrencySo the remaining two are V3 [milk, flour], and v4:[egg, milk, ham]
  5. Finally, client 1 wants to add bacon. It used to receive [milk, flour] and [egg] from the server in V3, so it combines these, adds bacon, and sends the final value [milk, flour, egg, bacon] to the server with the version number v3. This will overwrite v3[milk, flour] (please note that [egg] has been covered in the last step), but it is concurrent with v4[egg, milk, ham], so the server retains these two concurrent values.


Chapter VI zoning

Partitioning and replication

​ Partitions are often used in conjunction with replication so that replicas of each partition are stored on multiple nodes.


Figure 6-1 combined use of replication and partitioning: each node acts as the leader of some partitions, while other partitions act as followers.

Partitioning of key value data

​ 1. Partition by key range (Dictionary)

​ 2. Hash partitioning by key

Partition and secondary index

​ Secondary indexes are the foundation of relational databases, and they are also common in document databases. Many key value stores (such as HBase and volde mort) give up secondary indexes in order to reduce the complexity of implementation, but some (such as riak) have begun to add them because they are so useful for the data model. And the secondary index is also the cornerstone of search servers such as Solr and elasticsearch.

​ The problem with secondary indexes is that they do not map neatly to partitions. There are two ways to partition a database with a secondary index:Document based partitioningandTerm based partitioning

Secondary index by document


Secondary index based on keyword (term)


Partition rebalance

​ The database will change over time.

  • Query throughput increases, so you want to add more CPUs to handle the load.
  • The dataset size increases, so you want to add more disks and ram to store it.
  • If the machine fails, other machines need to take over the responsibility of the failed machine.

All of these changes require data and requests to be moved from one node to another. The process of moving a load from one node to another in a cluster is calledRebalancing

Regardless of the partition scheme, rebalancing usually meets some minimum requirements:

  • After rebalancing, the load (data storage, read and write requests) should be shared fairly among the nodes in the cluster.
  • When rebalancing occurs, the database should continue to accept reads and writes.
  • Only necessary data is moved between nodes for fast rebalancing and to reduce network and disk i/o loads.

​ Balancing strategy:

*Negative textbook: hash mod n
*Fixed number of partitions


*Dynamic partition
*Partition by node scale

Request routing(which node is connected to the database when the client sends the request?)

  1. Allow customers to contact any node (for example, viaRound robin load balancer)。 If the node happens to own the requested partition, it can directly process the request; Otherwise, it forwards the request to the appropriate node, receives the reply, and passes it on to the client.
  2. First, all requests from the client are sent to the routing layer, which determines the node that should process the request and forwards it accordingly. The routing layer itself does not process any requests; It is only responsible for load balancing of partitions.
  3. The client is required to know the allocation of partitions and nodes. In this case, the client can connect directly to the appropriate node without any mediation.


​ Many distributed data systems rely on an independent coordination service, such as zookeeper, to track cluster metadata. In the following figure, each node registers itself in zookeeper, which maintains reliable mapping of partitions to nodes. Other participants (such as routing layer or partition aware client) can subscribe to this information in zookeeper. As long as the partition allocation changes, or a node is added or deleted in the cluster, zookeeper will notify the routing layer to keep the routing information up to date.


Chapter VII affairs



​ In multithreaded programming, if one thread performs an atomic operation, it means that the other thread cannot see half of the result of the operation. The system can only be in the state before or after the operation, not in between.

​ The defining characteristics of acid atomicity are:The ability to abort a transaction in the event of an error and discard all write changes made by the transaction.If these writes are grouped into an atomic transaction and the transaction cannot be completed (committed) due to errors, the transaction will be aborted and the database must discard or undo any writes made so far in the transaction.


A specific set of statements about data must always be true。 NamelyInvariants。 For example, in the accounting system, all accounts as a whole must be offset by debit and credit. If a transaction starts from a valid database that satisfies these invariants, and any write operations during the transaction maintain this validity, it can be determined that the invariants are always satisfied.

​ Atomicity, isolation, and persistence are database attributes, while consistency (in the acid sense) is an application attribute. Applications may rely on the atomicity and isolation properties of the database to achieve consistency, but this does not depend solely on the database.


Simultaneous transactions are isolated from each other: they cannot offend each other. Most databases are accessed by multiple clients at the same time. If they read and write to different parts of the database, this is no problem, but if they access the same database records, they may encounterConcurrencyQuestions.



persistenceIt is a promise that once a transaction is successfully completed, any written data will not be lost even if a hardware failure or database crash occurs.

Single object and multi object operations


Figure isolation violation: one transaction reads an unexecuted write (“dirty read”) from another transaction

​ Without atomicity, error handling is much more complicated. Lack of isolation will lead to concurrency problems.

Transaction isolation level

Read committed

  1. When reading from the database, you can only see the submitted data (noDirty reads)。
  2. When writing to the database, only the written data will be overwritten (noDirty writes)。


The graph is not dirty read: User 2 can only see the new value of X after the transaction of user 1 has been committed.


Figure if dirty writes exist, conflicting writes from different transactions may be confused

Read deviation(not repeatable)

​ In the same transaction, the client will see different states of the database at different time points.Snapshot Isolation Often used to solve this problem. The implementation of snapshot isolation usually uses write locks to prevent dirty writes. From a performance point of view, a key principle of snapshot isolation is:Read does not block write, write does not block read。 In order to achieve snapshot isolation, the database must possibly retain several different committed versions of an object. This technique is calledMulti version concurrency control

​ If a database only needs to provideRead committedIsolation level withoutSnapshot Isolation It is sufficient to keep two versions of an object: the submitted version and the overwritten version that has not yet been submitted. Storage engines that support snapshot isolation are usually implemented using mvccRead committedIsolation level. A typical method isRead committedUse a separate snapshot for each query, andSnapshot Isolation Use the same snapshot for the entire transaction.


​ In the figure, when transaction 12 reads from account 2, it will see the $500 balance, because the deletion of the $500 balance is completed by transaction 13 (according to rule 3, transaction 12 cannot see the deletion performed by transaction 13), and the creation of the $400 record is also invisible (according to the same rule)

Figure snapshot isolation using multi version objects

Update lost

Two clients execute * * read modify write sequence * * at the same time. One of the write operations directly overwrites the result of the other write operation without merging the other write changes. This leads to data loss. Some implementations of snapshot isolation can automatically prevent such exceptions, while others require manual locking (`select for update`).

Unreal reading

​ Transactions read objects that meet certain search criteria. Another client writes, affecting search results. Snapshot isolation can prevent direct phantom reads, but phantom writes in skewed environments require special processing, such as index range locking.

Encapsulating transactions in stored procedures

​ Even if humans have found the critical path, transactions are still executed in an interactive client / server style, one statement at a time. The application makes a query, reads the results, and may make another query based on the results of the first query, and so on. Queries and results are sent back and forth between the application code (running on one machine) and the database server (running on another machine).

In this interactive transaction mode, the network communication between application and database takes a lot of time. If concurrent processing is not allowed in the database and only one transaction is processed at a time, the throughput will be very poor, because the database spends most of its time waiting for the application to issue the next query of the current transaction. In this kind of database, in order to obtain reasonable performance, multiple transactions need to be processed at the same time.

For this reason, systems with single threaded serial transactions do not allow interactive multi statement transactions. Instead, the application must commit the entire transaction code to the database as a stored procedure in advance. The differences between these methods are shown in the figure. If all the data required by the transaction is in memory, the stored procedure can execute very quickly without waiting for any network or disk i/o.


Figure differences between interactive transactions and stored procedures

Stored procedure and memory storage, making it possible to execute all transactions on a single thread. Because they do not need to wait for i/o and avoid the overhead of concurrency control mechanism, they can achieve fairly good throughput on a single thread.

Serializable snapshot isolation (SSI)

​ Detect old mvcc reads (uncommitted writes before reads)


​ When a transaction reads from a consistent snapshot in the mvcc database, it will ignore any writes made by other transactions that have not been committed at the time of taking the snapshot. In the above figure, transaction 43 considers Alice’son_call = true, because transaction 42 (modifying Alice’s standby status) was not committed. However, when transaction 43 wants to commit, transaction 42 has already committed. This means that the write ignored when reading the consistency snapshot has taken effect, and the premise of transaction 43 is no longer true.

​ To prevent such exceptions, the database needs to track the writes of one transaction that ignore another transaction due to mvcc visibility rules. When a transaction wants to commit, the database checks to see if any ignored writes have now been committed. If so, the transaction must be aborted.

Why wait for submission? Why not immediately abort transaction 43 When stale reads are detected? Because if transaction 43 is a read-only transaction, there is no need to abort because there is no risk of writing deviation. When transaction 43 reads, the database does not know whether the transaction will perform a write operation later. In addition, transaction 42 may be aborted when transaction 43 is committed or may still not be committed, so the read may not be stale after all. By avoiding unnecessary aborts, SSI retains snapshot isolation support for long-running reads from consistent snapshots.

​ Detect writes that affect read before (write after read)


​ In the above figure, transactions 42 and 43 both look for the doctor on duty at shift 1234. If inshift_idIf there is an index on, the database can use index entry 1234 to record the fact that transactions 42 and 43 read this data. If there is no index, this information can be tracked at the table level. This information only needs to be retained for a period of time: after a transaction is completed (committed or aborted), and all concurrent transactions are completed, the database can forget the data it reads.

​ When a transaction writes to the database, it must look in the index for other transactions that recently read the affected data. This process is similar to obtaining a write lock on the affected key range, but the lock does not block the transaction from completing to other transactions, but simply informs other transactions like a lead: the data you have read may not be up-to-date.

​ In the above figure, transaction 43 notifies transaction 42 that its previous read is out of date, and vice versa. Transaction 42 first commits and succeeds. Although the write of transaction 43 affects 42, the write has not yet taken effect because transaction 43 has not yet committed. However, when transaction 43 wants to commit, the conflicting writes from transaction 42 have been committed, so transaction 43 must abort.

Chapter VIII troubles of distributed system

Partial failureIs the decisive characteristic of distributed system. To tolerate mistakes, the first step istestingThey, but even that is difficult. Most systems do not have an accurate mechanism to detect whether a node fails, so most distributed algorithms rely onovertimeTo determine if the remote node is still available. Once a fault is detected, it is not easy for the system to tolerate it: no global variables, no shared memory, no common knowledge, or any other kind of shared state between machines.

​ Most non safety critical systems will chooseCheap and unreliable, notExpensive and reliable。 Distributed systems can run permanently without interruption at the service level, because all errors and maintenance can be handled at the node level – at least in theory. (in fact, if a wrong configuration change is applied to all nodes, the distributed system will still be paralyzed.).

Chapter IX consistency and consensus

Consistency guarantee

​ Final consistency: very weak guarantee

​ Linear consistency: one of the most consistent models

​ Causal consistency

Linear consistency


Figure this system is nonlinear and consistent, which leads to the confusion of fans

​ The basic idea behind linear consistency is simple: make the system look like there is only one copy of data.


Figure the point in time at which visual reads and writes appear to have taken effect. The last read of B is not linearly consistent

​ There are some interesting details to point out in the above figure:

  • The first client B sends a readxAnd then client D sends a request toxSet to0Then client a sends a request toxSet to1。 However, the read value returned to B is1(value written by a). This is OK: this means that the database first handles D writes, then a writes, and finally B reads. Although this is not the order in which requests are sent, it is an acceptable order because the three requests are concurrent. Perhaps B’s read request is slightly delayed on the network, so it does not reach the database until it has been written twice.

  • Before client a receives a response from the database, client B’s read returns1, indicating the write value1Succeeded. This is also possible: it does not mean that the value is read before writing, but that the correct response from the database to client a is slightly delayed in the network.

  • This model does not assume any transaction isolation: another client may change the value at any time. For example, C reads first1, and then read2, because the value between two reads is changed by B. You can use atomsCompare and set (CAS)Operation to check whether the value has not been changed by another client at the same time: B and CcasThe request was successful, but D’scasThe request failed (when the database processed it,xThe value of is no longer0 )。

  • The last read of customer B (in the shaded bar column) is not linearly consistent. This operation is similar to that of CcasWrite concurrency (it willxfrom2Update to4)。 When there are no other requests, the read of B returns2Yes. However, before the reading of B starts, client a has read the new value4Therefore, B is not allowed to read values older than a. Again, it is the same as Alice and Bob in Figure 9-1.

    This is the intuition behind linear consistency. The formal definition [6] describes it more accurately. It is possible (though computationally expensive) to test the linear consistency of a system’s behavior by recording the timing of all requests and responses and checking whether they can be arranged in a valid order [11].

Effective scenario of linear consistency

​ Locking and leadership election (zookeeper etcd uses consistency algorithm to ensure fault tolerance)

​ Constraints and uniqueness guarantee (user name or e-mail address must uniquely identify a user)

​ Cross channel timing dependence


Figure web server and image adjuster communicate through file storage and message queue to open the possibility of competitive conditions

​ This problem occurs because there are two different channels between the web server and the scaler: file storage and message queuing. Without the freshness guarantee of linear consistency, the competition condition between the two channels is possible.

Causal consistency

​ Causality imposes a sequence of events in the system (what happens before what, based on cause and effect). Unlike linear consistency, linear consistency puts all operations in a single totally ordered timeline. Causal consistency provides us with a weak consistency model: some events can beConcurrencySo the version history is like a timeline of continuous branching and merging. Causal consistency has no coordination overhead of linear consistency, and is much less sensitive to network problems.

Distributed transactions and consensus

Consensus: all nodes agree on the decision, and the decision is irrevocable.

Consensus issues

Linearly consistent CAS register

The register needs to decide whether to set a new value based on whether the current value is equal to the parameter given by the operation.

Atomic transaction commit

The database must * * decide * * whether to commit or abort distributed transactions.

Full sequence broadcast

The messaging system must * * decide * * the order in which messages are delivered.

Locks and leases

When several clients compete for a lock or lease, the lock determines which client successfully obtains the lock.

Member / coordination services

Given a certain fault detector (e.g. timeout), the system must * * decide * * which nodes are alive and which nodes need to 		  To be declared dead.

Uniqueness constraint

When multiple transactions try to create conflict records using the same key at the same time, the constraint must * * decide * * which is allowed and which is because 		  Failed due to constraint violation.

Chapter 10 batch processing

Three system types

​ *Service (online system)*: every time a is received, the service will try to process it as soon as possible and send back a response. Response time is usually Is the main measure of service performance, and availability is usually very important

Batch processing system (offline system)*: a large amount of input data, run oneJob*To process it and generate some output Data, which often takes a period of time (from a few minutes to a few days), so there are usually no users waiting for the job to complete. Instead, batch Volume jobs typically run on a regular basis (for example, once a day). The main performance measure for batch jobs is usually throughput (at Time required to process input of a specific size)

​ *Stream processing system (quasi real time system)*: between the two. Stream processing consumes input and generates output. It will operate on the event shortly after the event occurs. It will not wait for a set of fixed input data (batch processing). Therefore, it has the characteristics of low latency.

UNIX Analyze simple logs

cat /var/log/nginx/access.log | #1
    awk '{print $7}' | #2
    sort             | #3
    uniq -c          | #4
    sort -r -n       | #5
    head -n 5          #6
1. read the log file
2. divide each line into different fields according to the space. Each line only outputs the seventh field, which is exactly the requested URL. In our example, `/css/typography css`。
3. list the requested URLs in alphabetical order. If a URL has been requested n times, then after sorting, the file will contain the URL repeated N times in a row.
4. `uniq` command filters out duplicate lines in input by checking whether two adjacent lines are the same`- C` means that a counter will be output: for each different URL, it will report the number of times the URL appears in the input.
5. the second sort is sorted by the number (`-n`) at the beginning of each line. This is the number of URL requests. Then the result is returned in reverse order (`-r`), with the large number first.
6. finally, only the first five lines (`-n 5`) are output and the rest are discarded

4189 /favicon.ico
    3631 /2013/05/24/improving-security-of-ssh-private-keys.html
    2124 /2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
    1369 /
     915 /css/typography.css

​ Advantages: it accepts any form of input, which conforms to the concept that everything is a file (the output of a process can be the input of the next process)

​ High performance between commands.

​ Disadvantages: can only run on one machine (process to process) – > export Hadoop


​ MapReduce is a programming framework that you can use to write code to handle large datasets in distributed file systems such as HDFS. Support parallel computing on multiple machines. Mapper and reducer can only process one record at a time; They don’t need to know where their input comes from or where their output goes, so the framework can handle the complexity of moving data between machines.

​ To create a MapReduce job, you need to implement two callback functions, mapper and reducer.


​ Mapper will be called once for each input record, and its job is to extract key values from the input record. For each input, it can generate any number of key value pairs (including none). It does not retain any state from one input record to the next, so each record is processed independently.


​ The MapReduce framework pulls key value pairs generated by mapper, collects all values belonging to the same key, and iteratively calls reducer on this set of value lists. The reducer can generate output records (such as the number of occurrences of the same URL).

Distributed execution MapReduce


​ The size of each input file is usually hundreds of megabytes. The MapReduce scheduler (not shown in the figure) attempts to run each mapper on one of the machines that stores the copy of the input file, as long as the machine has enough spare ram and CPU resources to run the mapper task [26]. This principle is calledPlace calculations near data[27]: it saves the cost of copying input files through the network, reduces network load and increases locality.

​ The reduce end of the calculation is also partitioned. Although the number of map tasks is determined by the number of input file blocks, the number of reducer tasks is configured by the job author (it can be different from the number of map tasks). In order to ensure that all key value pairs with the same key eventually fall at the same reducer, the framework uses theHash valueTo determine which reduce task should receive a specific key value pair.

​ As long as mapper reads the input file and writes the sorted output file, the MapReduce scheduler will notify the reducer that the output file can be obtained from the mapper. The reducer connects to each mapper and downloads the ordered key value pair file of its corresponding partition. Sort by reducer partition, and copy partition data from mapper to reducer. The whole process is calledShuffle

​ The reduce task obtains files from mapper, merges them, and retains the ordered feature. Therefore, if different mappers generate records with the same key, these records will be adjacent in the reducer input. When the reducer is called, it will receive a key and an iterator as parameters. The iterator will sequentially scan all records with the key (because it may not be completely put into memory in some cases). Reducer can use any logic to process these records, and can generate any number of output records. These output records are written to files on the distributed file system

Sort merge connection


​ In order to achieve good throughput during batch processing, calculations must (as far as possible) be limited to a single machine. It is too slow to initiate a random access network request for each record to be processed. A better way is to get a copy of the user database and put it in the same distributed file system as the user behavior log.

​ When the MapReduce framework partitions mapper output by keys, and then sorts key value pairs, the effect is that all active events and user records with the same ID are adjacent to each other in the reducer input. The map reduce job can even sort these records, so that the reducer can always see the records from the user database first, followed by the activity events sorted by timestamp – this technology is calledSecondary sort

​ Since the reducer processes all records of a specific user ID at a time, it only needs to save one user record in memory at a time without sending any requests through the network. This algorithm is calledSort merge joinBecause mapper’s output is sorted by pressing the key, and then the reducer combines the ordered record lists from both sides of the connection.

Handle tilt

​ Collecting all activities related to a certain stream in a single reducer (such as replies to their published content) may lead to serious skew (also known asHot spot)– that is, a reducer must process more records than other reducers (see “load skew and hot spot elimination”). Since MapReduce jobs can only be completed when all mappers and reducers are completed, all subsequent jobs must wait for the slowest reducer to start.

​ In pigSkewed joinMethod first runs a sampling job to determine which keys are hot keys. When the connection is actually executed, mapper will record the associated records of hotkeysrandom(compared with the traditional MapReduce deterministic method based on key hash) sent to one of several reducers. For the connection input on the other side, the records related to the hotkey need to be copied to all reducers that process the key

Broadcast hash connection

​ One of the two connection inputs is small, so it has no partitions and can be fully loaded into a hash table. Therefore, you can start a mapper for each partition connecting the input big end, load the hash table of the input small end into each mapper, scan the big end, one record at a time, and query the hash table for each record.

Partitioned hash join

​ If two join inputs are partitioned in the same way (using the same keys, the same hash function, and the same number of partitions), you can apply the hash method to each partition independently.

Callback function

​ The distributed batch engine has a deliberately restricted programming model: callback functions (such as mapper and reducer) are assumed to be stateless, and there must be no externally visible side effects except for the specified output. This limitation allows the framework to hide some difficult distributed system problems under its abstraction: when encountering crashes and network problems, tasks can be retried safely, and the output of any failed task will be discarded. If multiple tasks of a partition succeed, only one of them can make its output actually visible.

​ Thanks to this framework, your code in a batch job does not need to worry about implementing a fault-tolerant mechanism: the framework can ensure that the final output of the job is the same as that without errors, and you may have to retry various tasks. Online services process user requests and write to the database as a side effect of processing requests. Compared with online services, batch processing provides much stronger reliability semantics.

Summary of batch characteristics

​ The distinguishing feature of a batch job is that it reads some input data and produces some output data, but does not modify the input — in other words, the output is derived from the input. Most importantly, the input data isBounded: it has a known, fixed size (for example, it contains snapshots of log files or database contents at some point in time). Because it is bounded, a job knows when it has finished reading the entire input, so a job will always be finished after it is finished.

Chapter XI stream processing

​ Message brokers and event logs can be viewed as streaming equivalents of file systems.

Message broker(message queue)

Difference from database

  • Databases typically retain data until it is explicitly deleted, and most message brokers automatically delete messages when they are successfully delivered to consumers. Such message brokers are not suitable for long-term data storage.
  • Because they can delete messages very quickly, most message brokers consider their working set to be fairly small — that is, the queue is very short. If the agent needs to buffer many messages, for example, because the consumer is slow (if the memory cannot hold messages, they may overflow to the disk), each message requires longer processing time, and the overall throughput may deteriorate [6].
  • Databases usually support secondary indexes and various ways of searching data, while message brokers usually support matching topics according to certain patterns and subscribing to their subsets. The mechanism is different. These are two basic ways for the client to select a part of the data it wants to know.
  • When querying the database, the results are usually based on the data snapshot at a certain point in time; If another client subsequently writes something to the database that changes the query results, the first client will not find that its previous results are now out of date (unless it repeats the query or polls for changes). In contrast, message brokers do not support arbitrary queries, but they notify clients when data changes (that is, when new messages are available).

Multiple consumers

Load balancing and fan out


Figure (a) load balancing: sharing consumption themes among consumers; (b) Fan out: deliver each message to multiple consumers.

​ The two modes can be used in combination: for example, two independent consumer groups can subscribe to a topic each group, and each group receives all messages together, but within each group, each message is only processed by a single node.

Partition log (log based message broker)


The graph producer sends messages by appending messages to the theme partition file, and the consumer reads these files in turn

Change data capture (CDC)


Figure writes data to a database in sequence, and then applies these changes to other systems in the same order

Three types of stream processing

Stream stream connection

​ Both input streams are composed of active events, and the join operator searches for related events in a certain time window. For example, it might link two activities that the same user has done within 30 minutes. If you want to find out the related events in a stream, the inputs on both sides of the connection may actually be the same stream(Self join)。

Stream table connection

​ One input stream consists of active events, and the other is the database change log. The change log ensures that the local copy of the database is up to date. For each activity event, the join operator queries the database and outputs an extended activity event.

Table table connection

​ Both input streams are database change logs. In this case, each change on one side is linked to the latest state on the other side. The result is the flow of changes to the materialized view resulting from the connection of the two tables.

Chapter 12 future of data system

​ Some systems are designated as recording systems, while other data is derived from recording systems through conversion. In this way, we can maintain indexes, materialized views, machine learning models, statistical summaries, and so on. By making these derivation and transformation operations asynchronous and loosely coupled, problems in a region can be prevented from spreading to irrelevant parts of the system, thereby increasing the robustness and fault tolerance of the whole system.

​ Representing data flows as transformations from one dataset to another can also help evolve applications: if you want to change one of the processing steps, such as changing the structure of an index or cache, you can rerun the new transformation code on the entire input dataset to re derive the output. Similarly, when a problem occurs, you can repair the code and reprocess the data for recovery.

​ These processes are very similar to those already completed in the database, so we rewrite the concept of data flow application as,UnbundlingDatabase components, and build applications by combining these loosely coupled components.

​ Derived status can be updated by observing changes in underlying data. In addition, the derived state itself can be further observed by downstream consumers. We can even send this data stream all the way to the end-user equipment that displays data, so as to build a user interface that can be dynamically updated to reflect data changes and continue to work when offline.

​ Next, we discussed how to ensure that all of these processes remain correct in the event of a failure. We see that the extensible strong integrity guarantee can be realized through asynchronous event processing, making operations idempotent by using end-to-end operation identifiers, and checking constraints asynchronously. The client can wait for the check to pass, or do not wait to move on, but may run the risk of violating the constraint and need to apologize. This method is more scalable and reliable than the traditional method using distributed transactions, and is applicable to many business processes in practice.

​ By building applications around data flows and checking constraints asynchronously, we can avoid most of the coordination work and create systems that guarantee integrity and still perform well, even in geographically dispersed situations and in the event of failure. Then, we discussed the use of auditing to verify data integrity and damage detection.

​ Finally, we took a step back and looked at some of the ethical issues of building data intensive applications. We see that although data can be used to do good things, it can also cause great harm: it is difficult to appeal against decisions that seriously affect people’s lives, leading to discrimination and exploitation, normalization of surveillance and exposure of private information. We also run the risk of data leakage, and may find that even using data in good faith may lead to unexpected consequences.

​ Because software and data have such a huge impact on the world, we engineers must keep in mind that we have the responsibility to work for the kind of world we want: a world that respects people and human nature. I hope we can work together to achieve this goal.