How to read a source code?


The ability to read source code is one of the underlying basic abilities of programmers. The reason why this ability is important is:

  • Inevitably, you need to read or take over other people’s projects. For example, research an open source project, such as taking over a project of others.
  • Reading excellent project source code is one of the important ways to learn from others’ excellent experience, which I deeply understand.
    Reading code and writing code are two different skills. The reason is that “writing code is expressing yourself and reading code is understanding others”. Because there are many projects, the authors of the project have their own styles, which requires a lot of energy to understand.

I have read extensively and intensively the project source code in recent years. I have also written some code analysis articles. In this paper, I will briefly summarize my methods.

Run first

The first step to start reading a project source code is to make the project compile by yourself and run smoothly. This is particularly important.

Some projects are complex and rely on many components. It is not easy to build a debugging environment, so it is not necessarily that all projects can run smoothly. If you can compile and run by yourself, then the scenario analysis, plus debugging code and debugging mentioned later will have the basis for expansion.

In my experience, whether a project code can successfully build a debugging environment is very different in efficiency.

After running, try to simplify your environment and reduce the interference information in the debugging process. For example, nginx processes requests in a multi process manner. In order to debug and track the behavior of nginx, I often set the number of workers to one, so that I can know which process to track when debugging.

For another example, many projects will bring the compilation optimization option or remove the debugging information by default, which may cause trouble during debugging. At this time, I will modify the makefile and compile it into – o0 – G, that is, compile and generate the version with debugging information and without optimization.

In a word, the debugging efficiency can be improved a lot after running, and on the premise of running, we should try to simplify the environment and eliminate interference factors.

Clarify your purpose

Although it is important to read the project source code, not all projects need to be seen clearly from beginning to end. Before starting reading, you need to clarify your purpose: whether you need to understand the implementation of one of the modules, the general structure of the framework, or the implementation of one of the algorithms, and so on.

For example, many people look at the code of nginx, and this project has many modules, including basic core modules (epoll, network transceiver, memory pool, etc.) and modules that extend a specific function. Not all of these modules need to be understood very clearly. In the process of reading the code of nginx, I mainly involved the following aspects:

  • Understand the basic process and data structure of nginx core.
  • Learn how nginx implements a module.
    With this general understanding of the project, the rest is to encounter specific problems and check the specific code implementation.

In short, it is not recommended to start reading the code of a project aimlessly. Headless flies will only consume their time and enthusiasm.

Distinguish between main line and branch line plot

With the clear reading purpose in front, we can distinguish the main line and branch line plot in the reading process. For example:

To understand the implementation process of a business logic, a dictionary is used in a function to save data. Here, “how to implement the data structure of the dictionary” belongs to the branch plot, and there is no need to study its implementation.
Under the guidance of this principle, for the code of the branch plot, such as a class that does not need to understand its implementation, readers only need to understand its external interfaces, understand the entry, exit parameters and functions of these interfaces, and regard this part as a “black box”.

Incidentally, I saw a way of writing C + + in the early years. There is only one class’s external interface declaration in the header file, and the implementation is transferred to the C + + file through the internal impl class, for example:

Header file:

// test.h
class Test {
  void fun();

  class Impl;
  Impl *impl_;

C + + files:

void Test::fun() {

class Test::Impl {
  void fun() {
    //Specific implementation

This way of writing makes the header file refreshing: there are no private members and private functions related to the implementation in the header file. Only the exposed interface allows users to know the functions provided by this class at a glance.

How to read a source code?

The “main line” and “branch line” plots are often switched in the whole process of code reading. Readers need to have some experience and know which part of the code belongs to the main line plot.

Vertical and horizontal

The code reading process is divided into two different directions:

  • Vertical: read in the order of the code. When you need to understand a process and algorithm, you often need to read vertically.
  • Horizontal: distinguish different modules for reading. When it is necessary to first clarify the overall framework, it is often necessary to read horizontally.
    Reading in the two directions should be carried out alternately, which requires code readers to have certain experience and be able to grasp the current direction of code reading. My suggestion is: in the process, the whole should be the first. Don’t go too deep into a detail before you don’t understand the premise of the whole. Treat a function and data structure as a black box and know their input and output. As long as it does not affect the overall understanding, put it down for the time being and look forward.

scenario analysis

If you have the previous foundation, you can make the project run smoothly in your own debugging environment and clarify the functions you want to understand, then you can conduct scenario analysis on the project code.

The so-called “scenario analysis” is to construct some scenarios by yourself, and then analyze the behavior in these scenarios by adding breakpoints and debugging statements.

Taking myself as an example, when I wrote Lua design and implementation, I explained that during the interpretation and execution of lua virtual machine instructions, I need to analyze each instruction. At this time, I use the method of scenario analysis. I will simulate the Lua script code using the instruction, and then debug the behavior in these scenarios at breakpoints in the program.

My usual practice is to add a breakpoint to an important entry function, and then construct the debugging code that triggers the scene. When the code stops at the breakpoint, observe the behavior of the code by looking at the stack, variable value, etc.

For example, in Lua interpreter code, generating opcode will eventually call the function luak\_ Code, then I will add a breakpoint to this function, and then construct the scene I want to debug. As long as I break at the breakpoint, I can see the complete call process through the function stack:

(lldb) bt
* thread #1: tid = 0xb1dd2, 0x00000001000071b0 lua`luaK_code, queue = '', stop reason = breakpoint 1.1
* frame #0: 0x00000001000071b0 lua`luaK_code
frame #1: 0x000000010000753e lua`discharge2reg + 238
frame #2: 0x000000010000588f lua`exp2reg + 31
frame #3: 0x000000010000f15b lua`statement + 3131
frame #4: 0x000000010000e0b6 lua`luaY_parser + 182
frame #5: 0x0000000100009de9 lua`f_parser + 89
frame #6: 0x0000000100008ba5 lua`luaD_rawrunprotected + 85
frame #7: 0x0000000100009bf4 lua`luaD_pcall + 68
frame #8: 0x0000000100009d65 lua`luaD_protectedparser + 69
frame #9: 0x00000001000047e1 lua`lua_load + 65
frame #10: 0x0000000100018071 lua`luaL_loadfile + 433
frame #11: 0x0000000100000eb9 lua`pmain + 1545
frame #12: 0x00000001000090cd lua`luaD_precall + 589
frame #13: 0x00000001000098c1 lua`luaD_call + 81
frame #14: 0x0000000100008ba5 lua`luaD_rawrunprotected + 85
frame #15: 0x0000000100009bf4 lua`luaD_pcall + 68
frame #16: 0x00000001000046fb lua`lua_cpcall + 43
frame #17: 0x00000001000007af lua`main + 63
frame #18: 0x00007fff6468708d libdyld.dylib`start + 1

The advantage of scenario analysis is that it will not look for a needle in a haystack in a project, but can narrow the problem to a range and expand understanding.

The concept of “scenario analysis” is not a term I came up with. For example, there are several books on analyzing code, such as “scenario analysis of Linux kernel source code” and “scenario analysis of windows kernel”.

Make good use of test cases

Good projects come with many use cases. Examples of this type include etcd and several open source projects produced by Google.

If the test cases are written carefully, it is worth studying. The reason is that test cases often construct some data independently for a single scenario to verify the process of the program. So, in fact, like the previous “scenario analysis”, it is one of the means to turn your attention from a large project to a specific scenario.

Clarify the relationship between core data structures

Although it is said that “programming = algorithm + data structure”, then from my practical experience, data structure is more important.

Because the structure defines the architecture of a program, there is no concrete implementation until the structure is determined. Like building a house, the data structure is the frame structure of the house. If a house is large and you don’t know the structure of the house, you will get lost in it. For the algorithm, if it belongs to the detailed part that does not need to be studied deeply for the time being, you can refer to the previous part of “distinguishing the plot of main line and branch line”, and first understand its entrance, exit parameters and functions.

“Bad programmers care about code. Good programmers care about data structures and their relationships,” Linus said

Therefore, when reading a code, it is particularly important to clarify the relationship between the core data structures. At this time, you need to use some tools to draw the relationship between these structures. There are many examples in my source code analysis blog, such as leveldb code reading notes, etcd storage implementation, and so on.

It should be noted that there is no strict sequential relationship between the two steps of scenario analysis and clarifying the core data structure. It is not necessarily to do something first and then do something, but interactively.

For example, if you have just taken over a project, you need to have a brief understanding of the project. You can read the code first to understand the core data structures. After understanding, if you don’t know the process under some scenarios, you can use scenario analysis. In short, alternate until you answer your questions.

Ask yourself a few more questions

The process of learning is inseparable from interaction.

If the reading code is only input, then output is also required. Only simple input is like feeding you, and only better digestion can become your own nutrition, and output is an important means to better digest knowledge.

In fact, this idea is very common. For example, students need to do output when they have class (input), and they need to code output when they have learned algorithm (input), and so on. In short, output is a kind of timely feedback in the learning process. The higher the quality, the higher the learning efficiency.

There are many means of output. When reading the code, it is recommended that you can ask yourself more questions, such as:

  • Why choose this data structure to describe this problem? How are other projects designed in similar scenarios? What data structures do this?
  • If I were to design such a project, what would I do?
    Wait, wait. The more active thinking, the better output. The output quality is directly proportional to the learning quality.

Write your own code and read notes

Since I started to write a blog, I have written many articles on code interpretation of various projects. The net name “codedump” also comes from the meaning of “dump the internal implementation principle of code”.

As mentioned earlier, the learning quality is directly proportional to the output quality, which is my own profound experience. Because of this, we should insist on writing our own analysis notes after reading the source code.

There are several points to pay attention to when writing such notes.

Although it’s a note, imagine explaining the principle to someone who is not familiar with the project, or imagine looking back at the article a few months or even years later. In this case, I will try my best to organize the language and explain it well.

Try to avoid large sections of paste code. I think in this kind of articles, a large section of code is a bit self deceptive: it looks like you understand, but it’s not necessarily true. If you really want to interpret a piece of code, you can use pseudo code or reduce the code. Remember: don’t deceive yourself and others. You should really understand. If you really want to add your own comments to the code, I have a suggestion that fork out a version of the code of the project and submit it to your GitHub. You can add your own comments and save the submission at any time. For example, I annotated etcd 3.1.10 Code: etcd-3.1.10-codedump. Similarly, other projects I read will fork a project with codedump suffix on GitHub.

Draw more pictures, one picture is better than a thousand words, and use graphics to show the relationship between code flow and data structure. I recently found that drawing ability is also a very important ability. I am learning how to use images to express my ideas from scratch.

Writing is a very important basic ability. A friend of mine recently taught me that the general meaning is: if you have a strong ability in some aspect, plus good writing and English, it will greatly enlarge your ability in this aspect. However, the basic abilities like writing and English are not acquired by a handful, and they need to be practiced for a long time. Blogging is a good way for technicians to practice writing.

PS: if there are many things that you can think of when you do it, you will be the one who will face this output in the future. For example, you need to maintain the code you write, show yourself the article you write, and so on. The world will be much better. For example, when I write a technical blog, I consider that the person who will read this document in the future may be me, so I will try to be clear and easy to understand when I write, and try to recall the details of that time when I see my own document again after a period of time. That’s why I rarely post a large section of code in my blog and supplement the legend as much as possible.


The above is my brief summary of some methods and precautions when reading the source code. Generally speaking, there are a few points:

  • Only better output can better digest knowledge. The so-called building debugging environment, scenario analysis, asking yourself more questions, writing code and reading notes are all carried out around output. In a word, you can’t expect to fully understand the principle of code just by looking at it like a dead fish. You need to find a way to interact with it.
  • Writing is one of people’s basic hard power. It can not only exercise their expression ability, but also help organize their ideas. For programmers, one of the ways to exercise their writing ability is to write blogs. The earlier they start to exercise, the better.

Finally, like any learned skill, the ability to read code also needs a long time and a lot of repeated practice. Next time, start practicing this skill from the project you are interested in.

Author: codedump


Scan the code to follow the wechat official account and get more wonderful articles~

How to read a source code?