How to refactor in open source projects?

Time:2022-4-26

Author: [email protected] -Labs

Recently completedDatabendThe large-scale reconstruction of the storage module basically painlessly completes the realization of the function without blocking the development of the existing function. This paper summarizes some of my personal experience, hoping to bring some inspiration.


Refactoring is not easy, especially on a very active codebase. Databend now has 40 + PR merged every week. In the past week, 800 + files have been changed, 21k lines of code have been added and 12K lines have been deleted. In such a code base, the cost of completing all the work in one battle is terrible. Therefore, in the whole life cycle of reconstruction, we need to keep close communication with the community to let the community know what you want to do, how to do it and how you are making progress. In this reconstruction, I summarized the following experience:

Write proposal

As the Apache way said:Community over code。 A good open source project is not just composed of code. It is meaningless to talk about abstract technology and code without the open source community. Therefore, before submitting large-scale changes to open source projects, we must clarify our ideas, explain our motivation, and let the open source community know what we want to do and how to do it.

These documents on paper can supplement information, improve ideas and build better designs during discussion. From a long-term perspective, the document can help latecomers understand why such a design was proposed at that time, so as to avoid repeated stepping on the pit. Moreover, a good design document can often influence and inspire the design of other open source projects, so as to promote the progress of the whole industry.

@tisonstayHow to participate in the Apache project communityMentioned:

For any non trivial change, a certain description is needed to indicate the motivation; For major changes, it is more necessary to design documents to retain memory. People’s memory is not permanent. They always forget why they did something in the first place. The precipitation of design documents plays a vital role in getting rid of people’s uncertain evolution in the community.

Before this refactoring, I worked in databendDiscussionsThey openly stated their vision and hopes to all community members:proposal: Vision of Databend DAL。 Then we communicated with the maintainers of several related modules and reached a wide consensus before starting this reconstruction. I think it is a key step to reach agreement with the maintainer. Otherwise, it is very likely that the maintainer will find that the conflict of ideas will lead to the termination or restart of the work in the middle of the work, which is very frustrating.

In addition, the open source community is essentially pursuingBased on open source contributionThe principle of elitism. Contributors must prove their value through contribution and gain the trust of the community before they can implement their own ideas. Therefore, before proposing a major change, it is best to join the community by participating in some good first issues to understand the norms of the community, be familiar with the compilation process of the community, keep in touch with the maintainers of the module and establish their influence in the community. Before this reconstruction, I helped the databend community complete the launch of the new community official website, transformed the new CI pipeline, and basically got familiar with the maintainers of each module.

It is worth noting that databend, like many new open source projects, does not have a perfect proposal process, but this does not mean that we cannot or do not need to submit proposals. The significance of submitting a proposal is to communicate with the community and reach an agreement. Don’t be bound by the form. As long as an agreement can be finally reached, it is acceptable. At the same time, the governance process of open source projects is constantly improving and evolving. In fact, in most projects, the formal proposal processing process is built in the process of continuous acceptance and processing of proposals by the community.

Create tracking issue

After submitting the proposal, it is better to create a tracking issues to track the implementation of the proposal.

Usually we name itTracking Issue for Xxxx, in this issue, we need

  • Link to the previously approved proposal so that community members can understand the context of their current work
  • List your work plan and todo list
  • Update your progress as you progress

In addition to our own planning and arrangement, another common situation is that during PR review, maintainers often put forward some follow-up improvement suggestions, which can be summarized in tracking issue.

The significance of tracking issues is to let the community know the current progress and provide needed help at an appropriate time. By viewing tracking issues, the community can understand whether the proposal is currently in an active development or stagnant state. Members interested in the implementation of the proposal can also feed back their ideas and willingness to participate through tracking issues.

In this refactoring, I passTracking issue for Vision of Databend DALTo track the progress of proposal. In addition to the characteristics of my own planning, I also recorded the feedback provided by many maintainers during review and some long-term immature ideas, which are the directions for future projects to improve.

Split pull requests

When implementing the proposal, the PR should be split according to the actual situation.

Too fine dismantling of PR will bring additional burden to maintainers, and the resulting large number of useless CI tasks are not conducive to low-carbon environmental protection; If the PR is too large, it will make it difficult for the maintainer to review. Either it is passed hastily or no one reviews it for a long time, which is not conducive to the promotion of the work, not to mention that a large PR has a greater probability of code conflict.

Each PR should be a complete individual that can achieve a specific goal. Take my two PR as an example:

Each PR here only does one very clear thing. The maintainer can quickly understand what the PR is doing by reading the title and description of the PR, so that the code review will get twice the result with half the effort.

The splitting of PR depends more on personal experience and style. When how to split it is better, you can ask the maintainer for advice.

Stay focused

In the process of implementing the proposal, we need to remain focused and do not extend the work boundary indefinitely.

In the process of implementation, we often encounter some new problems to be solved, which are often associated with the current proposal. At this time, it is best to adopt the principle of minimization to give priority to ensuring the successful delivery of the current proposal. On the one hand, people’s ability is limited. They can’t undertake all relevant tasks just because they are currently responsible for the implementation of the proposal, which often leads to the blocking of tasks of relevant modules on themselves and the failure to make maximum use of the power from the open source community; On the other hand, looking at the mountain and expanding the work boundary without limit will lead to the lack of a clear delivery time point for their achievements. They will feel that their energy is being exhausted and the patience and expectation of the community are being consumed.

Therefore, we need to remain focused, strive to resist the temptation of new functions and features, and give priority to ensuring the function delivery promised in the current proposal. Wait until the proposal is fully implemented and merged, give yourself a small vacation, and then open a new proposal and implement it. This cycle. Only when there is delivery can there be motivation and motivation to complete more work. Don’t set a goal that can never be achieved.

ask for help

In the process of implementing the proposal, we should actively communicate with the community and seek help from the community.

Keep in mind that we are not fighting alone. Behind us is the whole open source community. When you encounter problems, don’t think about it alone. Actively seek help from the community, ranging from language features (especially when you use rust) to functional modules. After checking the data of a problem for a day, there is no result. Asking the maintainer can often give a more reasonable solution or feasible work around.

Don’t worry about exposing your shortcomings. That’s how everyone comes about. Members of the open source community tend to have the same interests, so defenders are willing and motivated to help solve problems. My favorite rust developerdtolnayIs an excellent model: in PRAdd try_reserve and try_reserve_exact for OsStringIn, dtolnay gave detailed and clear review opinions, which helped me understand the details of this part of logic.

In implementationquery: Replace dal with dal2, let’s rock!In the process of, I encountered a problem that I hadn’t thought about for a long time, so I submitted itcommentTo maintainer@dantengskyask for help. In the comments, I gave a description of the problem, a complete backtrace, and the simplest reproduction steps. stay@dantengskyWith the help of, the problem was soon solved.

readWisdom of questioningIt will be very helpful, but it doesn’t matter if you haven’t read it at all. The core essence is mutual respect. Don’t be bossy or humble. We respect defenders because of their past contributions rather than their current community status. Compared with meaningless compliments such as bosses, it is often more pleasant to say thanks after solving problems.


In general, the most important thing for large-scale refactoring in open source projects is to maintain communication and continuous communication. Writing proposal, creating tracking issue and splitting PRs are all for communication services. On this basis, we need to pay attention to some implementation skills, stay focused, and seek help from the community in time. The above are some of the experiences I summed up in this reconstruction. I hope they can help you. Welcome to share them in the comment area~

About us

Databend is a new data warehouse developed by rust, open source and fully oriented to cloud architecture. It provides rapid elastic expansion capability and is committed to creating an on-demand and quantity based data cloud product experience.

Founded in March 2021, “Datafuse labs” is the team behind the open source project databend. The team has rich engineering experience in the field of cloud native database and is also an active contributor to the database open source community. At present, it has R & D centers in China, the United States and Singapore, focusing on innovation and practice in cutting-edge technology fields, as well as the open source ecology and community construction of databend.

Recommended Today

Chapter 45 SQL command from (I)

Chapter 45 SQL command from (I) A select clause that specifies one or more tables to query. outline SELECT … FROM [optimize-option] table-ref [[AS] t-alias][,table-ref [[AS] t-alias]][,…] parameter optimize-optioN – optional – specifies a single keyword or a series of keywords separated by spaces for query optimization options (optimizer tips). The following keywords are supported:%ALLINDEX、%FIRSTTABLE […]