Harvest life’s first 5K star open source project, experience and lessons to share with you

Time:2020-11-13

preface

Developing products is always a painful and happy thing. On the one hand, they suffer because they don’t understand the real needs of users, they are afraid of building a car behind closed doors, and they are worried that the technology can not be realized. On the other hand, they will be happy because they have made small achievements, won the recognition of users, and constantly help users solve problems, so as to continue to persist. Crawlab is such an open source project that makes me miserable and happy. Since its first commit in March last year, it has accumulated 5K star on GitHub and has grown into the most popular open source crawler management platform. Among them, crawlab has been listed on GitHub trending for many times, and has been constantly known and understood by developers around the world. At the same time, it has also been included by code cloud gitee and open source China, making it well known to more domestic developers. The community is also constantly improving. The members of wechat group are close to 1.2k, and there are people who ask questions and exchange experiences every day. At the same time, many enthusiastic users on GitHub put forward various issues to help us optimize our products.

From the initial flask + celery architecture to today’s self-developed scheduling engine with golang, it has experienced many iterations, large and small. Products are constantly mature, but also in continuous development. It is believed that more practical functions will be derived in the near future, including various good feedback suggestions from users.

The following figure is the GitHub cumulative star trend chart of crawlab. It can be seen that crawlab has experienced two large growth and continuous small-scale growth in the process of reaching 5K star.

Harvest life's first 5K star open source project, experience and lessons to share with you

The main purpose of this article is to record the milestones that I and my friends have worked together. In August last year (eight months ago), I wrote an article, “how to build a GitHub project with thousands of stars”, which discussed how to gain GitHub’s attention and build popular products.

Project introduction

Crawlab is a distributed crawler management platform based on golang, which supports multiple programming languages and crawler frameworks. For those who don’t know the crawler management platform, you can refer to this article, “how to quickly build a practical crawler management platform”, which has a detailed introduction to the crawler management platform.

View demo

Since the project was launched in April 2019, it has been highly praised by crawlers and developers. More than half of the users said that they had already used crawlab as the company’s crawler management platform. After several months of iteration, we have successively launched the functions of timed task, data analysis, configurable crawler, SDK, message notification, scrapy support, GIT synchronization, etc., which makes crawlab more practical and comprehensive, and can really help users solve the problems of crawler management.

Crawlab mainly solves the problem of managing a large number of crawlers, such as the need to monitor hundreds of miscellaneous websitesscrapyandseleniumIt is not easy to manage the project at the same time, and the cost of command line management is very high, and it is easy to make mistakes. Crawlab supports any language and any framework, with task scheduling and task monitoring, it is easy to effectively monitor and manage large-scale crawler projects.

Crawleb can easily integrate developers’ crawlers. With the CLI tool, you can upload any crawler project to crawlab and synchronize it to all nodes to form a distributed architecture. In addition, the SDK provided by crawleb makes it very easy for you to visualize the captured data into the crawlab interface. You can view and download the captured task results on the interface (as shown in the figure below).

Harvest life's first 5K star open source project, experience and lessons to share with you

Harvest life's first 5K star open source project, experience and lessons to share with you

Harvest life's first 5K star open source project, experience and lessons to share with you

Project development

It’s been a year since crawlab has had more than half a million stars on GitHub, but that doesn’t mean anything. Some netizens pointed out that the number of stars on GitHub doesn’t explain the problem. In fact, you can also buy stars on Taobao. Another interesting fact is that many works with thousands of star counts on GitHub are markdown projects. What is the markdown project? In other words, there are not many executable code files in this project. Most of them are markdown files filled with technical knowledge. They can be interview questions, knowledge collation, poems of Tang and Song Dynasties, etc. The prevalence of these markdown projects reflects the knowledge anxiety of developers. In fact, sometimes, concentrate on the use of a project, read and understand the source code, and even a few lines of code, you can constantly improve yourself. I am a wild programmer. I don’t like the underlying principles and theoretical derivation. I like to pick up the keyboard and dive into the production (TIE). Therefore, I like to find the user’s pain points in the product and solve them in the way of technology.

The following figure shows the development of the project crawlab.

Harvest life's first 5K star open source project, experience and lessons to share with you

It can be seen that in the initial stage of the project, Flag + celery is used to complete the distributed scheduling logic, which is actually a helpless move. Because at that time, my most familiar language was python, which did not know Java, golang and C + +, so I chose Python as the main programming language at that time, which foreshadowed the later framework replacement.

After accumulating the first users, they put forward a variety of opinions and feedback. Among them, docker is used to deploy suggestions, which has become the first choice for later deployment of crawlab. Some people put forward the concept of configurable crawler (it was not the name at that time, but it was my name), and I implemented it in Python.

However, it is very annoying that in the version of v0.2. X, timing tasks often have various bugs: sometimes they are repeated twice or more; sometimes they are not executed according to the time; sometimes they even miss the execution at all. What’s more worrying is that when the number of crawlers increases, the pressure on the back end begins to increase, and it takes 1 second or even several seconds to return the results each time. It’s hard even for me to use. So, I started to think fundamentally whether Python could not meet our needs.

At that time, I bought a course about golang learning in the gold digger’s booklet. Naturally, I thought that golang should be used to reconstruct the back-end application of crawlab. Therefore, while learning and practicing, I reconstructed crawlab from Python version to golang version and released it to v0.3. The reconstructed crawleb seems to have been upgraded several levels, and it is easy to crush the python version in terms of performance and stability. There are no more bugs, responses are no longer delayed, and concurrency is high. Better still, golang is a statically typed language, which can easily avoid some low-level errors caused by types (the cost is that more code is needed). I feel that refactoring crawlab with golang is the most successful decision in this project.

Compared with the welcome effect of golang refactoring, I think v0.4. X is relatively less direct. Many of the iteration functions of v0.4. X are based on user feedback, including message notification, permission management, interface installation dependency, scrapy support, and so on. These functions are developed for many users who need to apply crawler management platform in enterprises. Now I don’t know how many enterprises are actually using crawlab, but I believe that with the continuous improvement of crawlab, more and more small and medium-sized enterprises and even large enterprises can deploy and use it out of the box, and further promote it to other users with demand.

Project experience

There are many lessons from crawlab. A lot of people have been asking me, what has kept you from developing a free product for so long? Many people are also asking why I don’t develop a commercial version? I think these problems are natural and natural. In my opinion, if you want to do an open source project well, you can’t just have this idea. Of course, the idea of making money on it will lead the project astray. The following are the elements that I feel are essential for building a popular open source project.

Look for pain points and try to solve them

Many people have many pain points in their work and life. If you can spot these pain spots (note, pain, not itching), you will probably find an opportunity to solve it. We can try to find the pain point from the side. For example, crawlab was born when thinking about a work problem. My department has hundreds of crawlers, including selenium and other types of reptiles. At that time, the crawler management and implementation methods had many limitations, which led to the problems of low scalability and abnormal troubleshooting. We have a Web UI interface, but it’s just business oriented, not focused on the crawler itself. At that time, the author pondered whether this problem was only encountered by our company, or whether it was a common problem that almost every company that needed crawlers would encounter.

Of course, just finding the pain point is not enough. It needs to be verified. For example, in order to verify the previous hypothesis, I spent half a month making a minimal feasible product (MVP), crawlab V0.1, which only has the most basic function of executing crawler scripts. As a result, positive feedback has been received after the first edition was released, and there are also many suggestions for improvement. On the first day, the number of stars reached 30, and in the next two days it rose to 100. This verifies my hypothesis that the problem of crawler management is common. Everyone thinks that crawlab is a good idea and is willing to try it. This just started to give the author more motivation to constantly improve the product. So, starting from the problems around us is a good start.

Improve products through user research

Many people develop products behind closed doors in an attempt to make users fall in love with their products. This is a trap for technical personnel, we need to be vigilant at all times not to fall into the situation of self indulgence. How to understand the needs of users? A very effective method is user research.

In “how to build a GitHub project with thousands of stars”, I mentioned that there are two ways to conduct user research. One isDirect inquiry。 I often ask users about the use of crawlab in wechat group, whether there is anything that can be improved, which places are difficult to use, and what bugs are there. Most of the time, I can get the corresponding feedback, sometimes more important feedback. The other way isquestionnaire investigation。 This way is more objective and can quantitatively obtain the user’s usage, which is very helpful for us to understand the user’s usage. For example, I will regularly use the questionnaire star to design the questionnaire and put it into the wechat group. Usually, I can receive dozens or hundreds of answers. This sample is enough for the survey, and the questionnaire star can help analyze the data distribution of each problem, and can see the usage and demand at a glance.

Don’t underestimate the power of product promotion

In fact, this part is marketing and operation. When the product is launched, you should let users know and try your product for the first time. Because only in this way can we get immediate feedback from users and constantly improve your products. The promotion channels are various. First, it can bewrite an articleEvery time I publish, I will write articles on nuggets, SF, V2EX, open source China and other platforms to introduce new functions and product planning, so that more users can understand and try crawlab. Second, it needs to be doneSEOFor example, the crawler can be pushed to the top of the crawlab website according to the crawler’s website index. Third, build demo platform, which is the simplest way for users to try out the products. Users will see your products at the first sight and decide whether to further install and use them according to their appearance and functions. Practice has proved that this is a very effective means.

summary

Crawleb is now in its second year. Crawlab is a rising star. Compared with the predecessors of gerapy, spiderkeeper and scrapydweb, it is younger, more flexible and practical. That’s why so many people are trying to use crawlab. To create open source products is a long-term business, which is not everyone can create. Therefore, patience and craftsmanship are needed. The so-called craftsman spirit is not to make the product more perfect, but to make the product more grounded, more user-friendly, more satisfied with users, and more able to solve users’ problems. This is the spirit of craftsmanship. Therefore, we can not blindly pursue the perfection of technology behind closed doors and ignore the real problems of users. Crawlab still has a long way to go in solving users’ problems. But we don’t worry because we now have a strong development team, a growing community, and users who are constantly giving feedback. I believe that in the next year, crawlab will solve the problems of more users, make the crawler easier, and usher in the second 5K star.

I hope this article will be helpful to your work and study. If you have any questions, please add the author’s wechat tikazyq1, or leave a message at the bottom, and the author will try his best to answer them. thank you!

<p align=”center”>

<img height="360">

</p>

reference resources

  • Github: https://github.com/crawlab-te…
  • Demo: https://crawlab.cn/demo