Top 10 high performance development gemstones, I want to eliminate half of programmers!


A problem that programmers often face is: how to improve program performance?

In this article, we step by step, from memory, disk I / O, network I / O, CPU, cache, architecture, algorithm and other multi-level progressive, series up the top ten core technologies that must be mastered in high-performance development.

-I / O Optimization: zero copy technology
-I / O Optimization: multiplexing technology
-Thread pool technology
-Lock free programming technology
-Interprocess communication technology
-RPC && serialization technology
-Database indexing technology
-Cache Technology & Bloom filter
-Full text search technology
-Load balancing technology

Are you ready? Sit down and start!

First, let’s start with the simplest model.

The boss told you, how to develop a static web server and send disk files (web pages and pictures) through the network?

You spent two days rolling out a version 1.0:

  • The main thread enters a loop, waiting to connect
  • A connection starts a worker thread to handle it
  • In the worker thread, wait for the other party’s request, then read the file from the disk and send data to the socket interface

One day on the line, the boss found that it was too slow, and the larger pictures were loaded with a Caton feeling. To optimize, you need to:

I / O Optimization: zero copy technology

The above working thread reads files from the disk and then sends data through the network. The data needs to be copied four times from the disk to the network, and the CPU needs to carry it twice.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Zero copy technologyTo liberate the CPU, file data is sent directly from the kernel without copying to the application buffer, wasting resources in vain.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Linux API:

ssize_t sendfile(
  int out_fd, 
  int in_fd, 
  off_t *offset, 
  size_t count

The function name has clearly explained the function of the function: sending files. Specify the file descriptor and network socket descriptor to send. A function is done!

After using zero copy technology, version 2.0 is developed, and the picture loading speed has been significantly improved. However, when the boss finds that there are more people visiting at the same time, it slows down and lets you continue to optimize. At this time, you need to:

I / O Optimization: multiplexing technology

In the previous version, each thread is blocked in the recv waiting for the other party’s request. As more people come to visit, more threads are opened. A large number of threads are blocked, and the system running speed decreases.

At this time, you need multiplexing technology to useselectModel, put all waiting (accept, recv) in the main thread, and the worker thread does not need to wait any longer.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


After a period of time, more and more people visited the website, and even select began to be overwhelmed. The boss continued to let you optimize performance.

At this time, you need to upgrade the multiplexing model toepoll

Select has three disadvantages and epoll has three advantages.

  • The select bottom layer uses an array to manage socket descriptors. The number of sockets managed at the same time is limited to a few thousand. Epoll uses trees and linked lists to manage, and the number of sockets managed at the same time can be large.

  • Select won’t tell you which socket came the message. You need to ask one by one. Epoll directly tells you who’s coming, without polling.

  • When select makes system calls, it also needs to copy the socket list in user space and kernel space. It is a waste to call select in the loop. Epoll manages socket descriptors uniformly in the kernel without copying back and forth.

Using epoll multiplexing technology and developing version 3.0, your website can handle many user requests at the same time.

However, the greedy boss is not satisfied and is not willing to upgrade the hardware server, but allows you to further improve the throughput of the server. After your research, you found that in the previous scheme, working threads are always created only when they are used up and closed when they are used up. When a large number of requests come, threads are constantly created, closed, created and closed, which is very expensive. At this time, you need to:

Thread pool technology

We can start a batch of working threads after the program starts, instead of creating them when there is a request. We use a public task queue to deliver tasks to the queue when the request comes. Each working thread uniformly takes tasks from the queue for processing. This isThread pool technology

Top 10 high performance development gemstones, I want to eliminate half of programmers!


The use of multithreading technology improves the concurrency of the server to a certain extent, but at the same time, in order to synchronize data between multiple threads, it is often necessary to use mutexes, signals, conditional variables and other means to synchronize multiple threads. These heavyweight synchronization methods often lead to multiple thread switching in user state / kernel state, system call and thread switching are not small overhead.

In the thread pool technology, a common task queue is mentioned, from which each working thread needs to extract tasks for processing. Here, it involves the synchronous operation of multiple working threads on this common queue.

Is there any lightweight scheme to achieve multi-threaded safe data access? At this time, you need to:

Lock free programming technology

In multithreaded concurrent programming, thread synchronization is required when public data is encountered. The synchronization here can be divided intoBlocking synchronizationandNon blocking synchronization

Blocking synchronization is easy to understand. Our commonly used mutexes, signals, conditional variables and other mechanisms provided by these operating systems belong to blocking synchronization, and their essence is to add “locks”.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


The corresponding non blocking synchronization is to realize synchronization without lock. At present, there are three kinds of technical schemes:

  • Wait-free
  • Lock-free
  • Obstruction-free

The three kinds of technical schemes all realize synchronization without blocking and waiting through certain algorithms and technical means, among which lock free is the most widely used.

Lock free can be widely used because the current mainstream CPUs provide atomic read modify write primitives, which is a famous exampleCAS(Compare-And-Swap)Operation. On Intel x86 series processors, it iscmpxchgSeries of instructions.

//Realize lock free through CAS operation
do {
} while(!CAS(ptr,old_data,new_data ))

We often see data structures such as lockless queue, lockless table and lockless HashMap. Most of the lockless core comes from this. In daily development, the proper use of lockless programming technology can effectively reduce the additional overhead caused by multithread blocking and switching and improve performance.

After the server was online for some time, it was found that the service often crashed abnormally. After troubleshooting, it was found that there was a bug in the working thread code. Once the server crashed, the whole service was unavailable. So you decide to separate the worker thread and the main thread into different processes. The worker thread crash cannot affect the overall service. There are multiple processes at this time. You need to:

Interprocess communication technology

When it comes to interprocess communication, what can you think of?

  • The Conduit
  • name pipes
  • socket
  • Message queue
  • signal
  • Semaphore
  • Shared memory

The above methods of inter process communication are introduced and compared in detail, and will not be repeated here.

For high-frequency and large amount of data interaction between local processes, the first isShared memorySuch a scheme.

Modern operating systems generally adopt the management scheme based on virtual memory. Under this memory management mode, each process is forcibly isolated. The memory address used in the program code is aVirtual address, the memory management algorithm of the operating system allocates and maps to the corresponding physical memory page in advance, and the CPU performs real-time conversion and translation of the accessed memory address when executing code instructions.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


As can be seen from the above figure, although different processes have the same memory address, the memory pages actually storing data are different with the cooperation of the operating system and CPU.

The core of the interprocess communication scheme of shared memory is:If the same physical memory page is mapped to two process address spaces, can both sides read and write directly without copying?

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Of course, shared memory is only the final data transmission carrier. In order to realize communication, both parties have to rely on other notification mechanisms such as signals and semaphores.

With the high-performance shared memory communication mechanism, multiple service processes can work happily. Even if a work process crashes, the whole service will not be paralyzed.

Soon, the boss increased his demand and was no longer satisfied with only providing static web browsing. He needed to be able to realize dynamic interaction. This time, the boss has a conscience and added a hardware server to you.

So you built a web development framework in Java / PHP / Python and other languages, and set up a separate service to provide dynamic web page support and work with the original static content server.

At this time, you find that communication is often required between static services and dynamic services.

At first, you use the HTTP based restful interface to communicate between servers. Later, you find that it is inefficient to transmit data in JSON format. You need a more efficient communication scheme.

You need to:

RPC && serialization technology

What is RPC technology?

Full name of RPC remote procedure call, remote procedure call. In our usual programming, we call functions at any time. These functions are basically located locally, that is, the code block at a certain position of the current process. But what if the function to be called is not local, but on a server on the network? This is the source of remote procedure calls.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


As can be seen from the figure, the function call through the network involves the packaging and unpacking of parameters, network transmission, packaging and unpacking of results, etc. The data packaging and unpacking need to rely on serialization technology.

What is serialization technology?

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Serialization is simply to convert objects in memory into data that can be transmitted and stored, and the reverse operation of this process is deserialization. Serialization & deserialization technology can realize the transportation of memory objects on local and remote computers. It’s like closing an elephant in the refrigerator. There are three steps:

  • Encodes local memory objects into data streams
  • The above data stream is transmitted through the network
  • The received data stream is built into an object in memory

There are many free and open source frameworks for serialization technology. There are several indicators to measure a serialization framework:

  • Whether cross language usage is supported and which languages can be supported
  • Is it just a simple serialization function, and the package does not contain RPC framework
  • Serialized transport performance
  • Extended support capability (compatibility before and after adding and deleting fields in data objects)
  • Whether dynamic parsing is supported (dynamic parsing means that it can be parsed immediately according to the obtained data format definition file without compiling in advance)

The following three popular serialization frameworks protobuf, thrift and Avro are compared:



Support language: C + +, Java, python, etc

Dynamic support: poor, generally need to compile in advance

Include RPC: no

brief introductionProtobuf is a serialization framework produced by Google. It is mature, stable and powerful. Many large manufacturers are using it. It is only a serialization framework without RPC function, but it can be used together with the GPRC framework produced by Google as a golden partner for the development of back-end RPC services.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


The disadvantage is weak support for dynamics, but this phenomenon needs to be improved in the updated version. In general, protobuf is a highly recommended serialization framework.



Support language: C + +, Java, python, PHP, c#, go, JavaScript, etc

Dynamic support: poor

Include RPC: Yes

brief introduction: This is an RPC framework produced by Facebook. It contains binary serialization scheme, but thrift’s RPC and data serialization are decoupled. You can even choose custom data formats such as XML and JSON. There are also a number of large factories in China, which are equal to protobuf in terms of use and performance. Disadvantages like protobuf, support for dynamic parsing is not very friendly.


Support language: C, C + +, Java, python, c# etc

Dynamic support: OK

Include RPC: Yes

brief introduction: This is a serialization framework derived from Hadoop ecology. It has its own RPC framework and can also be used independently. Compared with the first two, the biggest advantage is to support dynamic data parsing.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Why do I keep talking about this dynamic parsing function? In a previous project experience, Xuanyuan encountered the selection of three technologies, which are in front of us. A C + + developed service and a Java developed service are required to be able to perform RPC.

Both protobuf and thrift need to “compile” the corresponding data protocol definition file into the corresponding C + + / Java source code, and then integrate it into the project and compile it together for parsing.

At that time, the students of the Java project team strongly rejected this practice because the strong business code compiled in this way was integrated into their business independent framework services, and the business was constantly changing, which was not elegant enough.

Finally, after testing, Avro is finally selected as our scheme. On the Java side, you only need to dynamically load the corresponding data format file to parse the obtained data, and the performance is good. (of course, the C + + side still chooses the practice of compiling in advance)

Since your website supports dynamic capabilities, you have to deal with the database, but with the growth of users, you find that the query speed of the database is getting slower and slower.

At this time, you need to:

Database indexing technology

Think about a math textbook in your hand, but the catalog was torn off. Now you need to turn to the page about trigonometric functions. What should you do?

Without a directory, you have only two ways, either page by page or randomly until you find the page of the trigonometric function.

The same is true for the database. If our data table does not have a “directory”, we have to scan the whole table to query the record rows that meet the conditions, which is annoying. Therefore, in order to speed up the query, you have to set up a directory for the data table. In the database field, this isIndexes

Generally, the data table will have multiple fields, so different indexes can be set according to different fields.

Classification of indexes

  • primary key
  • Clustered index
  • Nonclustered index

As we all know, a primary key is a field that uniquely identifies a data record (there are also multiple fields that uniquely identify a data record together)composite keys ), the corresponding is the primary key index.

Clustered index refers to an index whose logical order is consistent with the physical storage order of table records. Generally, the primary key index conforms to this definition, so generally speaking, the primary key index is also a clustered index. However, this is not absolute. There are still differences in different databases or different storage engines under the same database.

The leaf node of the clustered index directly stores data and is also a data node, but the leaf node of the non clustered index does not store actual data and needs a secondary query.

Implementation principle of index

There are three main implementations of indexes:

  • B + tree
  • Hashtable
  • bitmap

Among them, B + tree is used most. It is characterized by many nodes. Compared with binary tree, it is a multi fork tree and a flat fat tree. Reducing the depth of the tree is conducive to reducing the number of disk I / O, which is suitable for the storage characteristics of database.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


The index implemented by hash table is also called hash index. Data positioning is realized through hash function. Hash algorithm is characterized by high speed and constant order time complexity, but its disadvantage is only suitable for accurate matching, not suitable for fuzzy matching and range search.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Bitmap indexes are relatively rare. Imagine a scenario where there are only a limited number of possible values for a field, such as gender, province, blood type, etc. what happens if the B + tree is used as the index for such a field? There will be a large number of leaf nodes with the same index value, which is actually a waste of storage.

Bitmap index is optimized based on this point. There are only a few limited items for field values. When there are a large number of repetitions in this column of the data table, it is the time for bitmap index to show its skills.

The so-called bitmap is bitmap. Its basic idea is to establish a binary bitmap for each value of the field to mark whether the column field of each record in the data table is the corresponding value.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Although the index is good, it can not be abused. On the one hand, the index is ultimately stored on disk, which will undoubtedly increase the storage overhead. In addition, more importantly, the addition and deletion of the data table will generally be accompanied by the update of the index, so it will also have a certain impact on the write speed of the database.

Your website now has more and more visitors, and the number of online people has increased greatly. However, the requests of a large number of users bring a large number of back-end programs to access the database. Gradually, the bottleneck of the database began to appear and could no longer support the growing number of users. The boss once again assigned you the task of performance improvement.

Cache Technology & Bloom filter

From the cache of memory data by the physical CPU to the cache of web content by the browser,cacheTechnology is everywhere in the computer world.

In the face of the current database bottleneck, cache technology can also be used to solve it.

Every time you access the database, you need to look up the table in the database (of course, the database itself also has optimization measures). The bottom layer is to perform one or more disk I / O, but those involving I / O will slow down. If it is some data that is frequently used but will not change frequently, why not cache it in memory and do not need to find the database every time, so as to reduce the pressure on the database?

Top 10 high performance development gemstones, I want to eliminate half of programmers!


There is a market when there is demand, and there will be products when there is marketmemcachedandRedisAs a representative of memory object caching system came into being.

The cache system has three famous problems:

  • Cache penetration: the purpose of cache setting is to intercept requests to the database storage layer at a certain level. The meaning of penetration is that the interception failed, the request finally went to the database, and the cache did not produce its due value.

  • Buffer breakdown: if the cache is understood as a wall in front of the database to “resist” query requests for the database, the so-called breakdown is to make a hole in this wall. It usually happens when a hot data cache expires, and at this time, a large number of query requests for the data come, and everyone connects to the database.

  • Cache avalanche: if you understand the breakdown, the avalanche will be better understood. As the saying goes, breakdown is a person’s avalanche, and avalanche is the breakdown of a group of people. If the wall is full of holes, how can it stand? Take jujube pills.

With the cache system, we can first ask whether the cache system has the data we need before requesting from the database. If it does and meets the needs, we can save a database query. If not, we can request from the database again.

Note that there is a key problem here. How to judge whether the data we want is in the cache system?

Further, we abstract this problem:How to quickly determine whether a set with a large amount of data contains the data we specify?

Top 10 high performance development gemstones, I want to eliminate half of programmers!


At this time, it isBloom filterIt’s time to show your skills. It was born to solve this problem. How does bloom filter solve this problem?

Back to the above problem, this is actually a search problem. For the search problem, the most commonly used solutions are search tree and hash table.

Because this problem has two key points: fast and large amount of data. The tree structure must be excluded first. The hash table can achieve constant order performance, but when the amount of data is large, on the one hand, it requires a huge capacity of the hash table. On the other hand, how to design a good hash algorithm to achieve the hash mapping of such a large amount of data is also a difficult problem.

For the problem of capacity, considering that we only need to judge whether the object exists rather than get the object, we can set the table item size of the hash table to 1 bit, 1 indicates existence and 0 indicates non existence, which greatly reduces the capacity of the hash table.

For the problem of hash algorithm, if we have lower requirements for hash algorithm, the probability of hash collision will increase. If a hash algorithm is easy to conflict, get more, and the probability of multiple hash functions conflicting at the same time is much smaller.

Bloom filter is based on this design idea:

Top 10 high performance development gemstones, I want to eliminate half of programmers!


When the corresponding key value is set, the corresponding bit position is set to 1 according to the calculation of a group of hash algorithms.

However, when the corresponding key value is deleted, the corresponding bit position cannot be 0, because it is not guaranteed that a hash algorithm of another key is mapped to the same position.

It is precisely because of this that another important feature of Bloom filter is introduced:What the bloom filter determines to exist does not necessarily exist, but what it determines not to exist must not exist.

The content of your company’s website is more and more, and users’ demand for fast whole site search is becoming stronger and stronger. At this time, you need to:

Full text search technology

For some simple query requirements, the traditional relational database can still meet them. However, once the search requirements become complex, such as according to the article content keywords, multiple search conditions but logical combination, the database will be stretched. At this time, a separate index system is needed to support it.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Now widely used in the industryElasticSearch(ES) is a powerful search engine, which integrates the advantages of full-text retrieval, data analysis and distributed deployment, and has become the first choice of enterprise search technology.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Es uses restful interface and JSON as data transmission format, supports multiple query matching, and provides SDK for all mainstream languages, which is easy to use.

In addition, ES often forms a complete solution for log collection, analysis and display together with two other open source software logstash and kibana:Elk architecture

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Among them, logstash is responsible for data collection and analysis, elasticsearch is responsible for search, and kibana is responsible for visual interaction, which has become an iron triangle for many enterprise log analysis and management.

No matter how we optimize, the power of a server is limited. With the rapid development of the company’s business, the original servers have been overwhelmed, so the company purchased multiple servers and deployed multiple copies of the original services to meet the growing business demand.

Now that there are multiple servers providing services in the same service, you need to evenly allocate the user’s requests to each server. At this time, you need to:

Load balancing technology

seeing the name of a thing one thinks of its function,load balancing It means that the load is evenly distributed to multiple business nodes.

Top 10 high performance development gemstones, I want to eliminate half of programmers!


Like cache technology, load balancing technology also exists in every corner of the computer world.

Realize entities according to equilibriumIt can be divided into software load balancing (such as LVS, nginx and haproxy) and hardware load balancing (such as A10 and F5).

According to network hierarchy, it can be divided into four layers of load balancing (based on network connection) and seven layers of load balancing (based on application content).

According to the equilibrium strategy algorithm, it can be divided into polling equalization, hash equalization, weight equalization, random equalization or a combination of these algorithms.

For the current problems, nginx can be used to realize load balancing. Nginx supports load balancing configuration in many ways, such as polling, weight, IP hash, minimum number of connections, shortest response time and so on.


upstream web-server {


upstream web-server {
    server weight=1;
    server weight=2;

IP hash value

upstream web-server {
    server weight=1;
    server weight=2;

Minimum number of connections

upstream web-server {
    server weight=1;
    server weight=2;

Minimum response time

upstream web-server {
    server weight=1;
    server weight=2;


High performance is an eternal topic, which involves far more technologies and knowledge than those listed above.

The optimization of every link from physical hardware CPU, memory, hard disk and network card to software level communication, cache, algorithm and architecture is the road to high performance.

Although the road is endless and faraway, I still want to pursue the truth in the world.

Recommended Today

Could not get a resource from the pool when the springboot project starts redis; nested exception is io. lettuce. core.

resolvent: Find your redis installation path: Start redis server Exe After successful startup: Restart project resolution. ———————————————————————->Here’s the point:<——————————————————————- Here, if you close the redis command window, the project console will report an error. If you restart the project, the same error will be reported at the beginning, The reason is: It is inconvenient to […]