Recently, Liu Qi, CEO of our company, accepted the interview of “love analysis ifenxi”, analyzed the development trend of the current database market, the characteristics and application scenarios of tidb, and revealed the company’s future development layout. The following is love analysis report and interview record, with a large amount of information, enjoy:)
Research Li Zhe and Wang Qi
Written by Li Zhe
Even if we narrow the scope from big data to database, pingcap is still a very special company, and its product tidb is one of the few databases facing HTAP scenarios on the market.
Traditionally, databases are divided into transactional databases (TP) and analytical databases (AP).
In recent years, the rising NoSQL databases, such as mongodb and HBase based on Hadoop, are more analytical databases, which solve large-scale data query and analysis problems through distributed architecture.
However, the transactional database bearing the production system is always controlled by the traditional database manufacturers. Oracle, IBM and other companies occupy the traditional large-scale enterprise market. Most of the small and medium-sized enterprises and Internet companies adopt the open-source technology mysql, and few new technologies and companies can enter the market.
In 2012, Google’s spanner came out, which is a transactional database based on distributed architecture. Inspired by Google, a series of emerging database manufacturers such as cockroachdb (cockroach database) have emerged in foreign countries to solve TP problems, but the domestic market is almost blank, and there are no start-ups to develop such databases.
In 2015, pingcap was established to fill the gap in China.
Team with internet background, using open source mode to make database
Unlike other database vendors on the market, most of the founding teams of pingcap come from large Internet companies, such as Peapod, Jingdong, etc., and almost no from traditional it or database vendors.
In the background of the Internet, every member of the founding team has experienced the period of exponential growth of data, and has the experience of dealing with massive data. Expansibility will be given priority when making database products.
At the same time, because most Internet companies will adopt MySQL technology, the first compatible protocol of tidb is mysql, which makes pingcap easier to obtain customers.
Another feature of the Internet is that open source is the first. Pingcap has established the method of using open source to do database since the first day. But unlike other teams, Liu Qi, the founder of pingcap, and others, who used to be the author of distributed cache project CODIS, have the ability of open source community operation and know how to develop products with the help of community forces.
On the one hand, the open source community will expand the coverage of pingcap products and bring potential customers; on the other hand, through the operation of the open source community, pingcap will focus more on the research and development of the core product tidb, and other functions can be partially realized by users of the open source community.
In addition, through user feedback, pingcap can understand the potential needs of users as a reference for tidb R & D.
The product supports both TP and AP, with strong consistency and scalability as its main features
At first, tidb only solves the TP problem, but in the practical application process, it is very difficult for customers to directly replace the original MySQL database with a new database, especially when the database manufacturer is an unknown start-up company.
The practice of most enterprise customers is that the front-end still keeps the traditional MySQL database, and takes the tidb database as the data mart behind, which is connected with the front-end database. However, the real-time performance of this data mart is far better than that of Hadoop data mart, which can run in the actual production system.
After running in this way for a period of time and the customer approves the pingcap product, the MySQL database will be replaced gradually and tidb will be used as the front-end database.
When the customer uses the tidb database as a data mart, because the front-end database needs to query data from this data mart, therefore, higher requirements are put forward for the query function of the tidb database. Tidb adjusted its database executor to expand the AP function.
In this way, tidb supports both TP and AP functions and becomes a distributed HTAP (hybrid transactional / analytical processing) database product.
In the TP scenario, tidb has the characteristics of strong consistency, which can carry industries such as finance with high sensitivity to data consistency. Compared with traditional databases, tidb scalability is the biggest advantage. Tidb can improve performance by increasing the number of machines.
In the AP scenario, compared with HBase, pingcap has better real-time performance and faster data processing speed.
At this stage, it mainly covers Internet finance, games and other Internet fields, and the sales leads are mainly from the open source community
Compared with traditional enterprises, Internet companies are more likely to try new technologies, and teams with internet background are more able to understand the business characteristics of Internet companies.
At the same time, the development speed of Internet companies is far faster than that of traditional enterprises, and the growth speed of data volume is extremely fast. The demand for improving the underlying technology architecture and improving the database performance is more intense, especially in the game industry and Internet finance industry.
These factors prompted most of pingcap’s early customers to come from Internet enterprises, and Tongcheng tourism, 360 finance, Mobai bicycle, etc. have become pingcap’s customers in succession.
By the end of 2017, the overall team size of pingcap has reached about 100 people, of which more than 80% are R & D and only one full-time sales.
A salesperson has a very limited customer acquisition capacity. Pingcap mainly obtains customers through the open source community. Salespeople are only responsible for following up interested enterprises. In 2017, there were 200 users applied in the actual production environment, resulting in more than ten paid customers.
At this stage, pingcap still focuses on product polishing and community operation, and has not yet entered the stage of product wide promotion. Therefore, in 2018, pingcap will consider entering traditional industries such as finance, medical treatment and logistics, but will not increase sales team in a large scale, and will still adopt a more cautious market strategy.
Recently, love analysis interviewed Liu Qi, founder of pingcap. He elaborated on pingcap’s business model, future strategy, and future development trend of database industry. Now he will share part of the interview content.
Based on the original intention of solving the database scalability problem, the product can meet the business requirements of TP and AP at the same time
Love analysis: what was your original intention to create pingcap?
Liu Qi: I had this idea when I was working in Jingdong. At that time, there was no database that could be extended very well. The most common way was to divide databases and tables. However, there are some disadvantages in this way. Firstly, it has poor elastic expansion ability, secondly, it has poor usability, thirdly, it has a heavy mental burden on programming, and fourthly, it has a weak expression ability.
At that time, I was working on a project and also needed distributed database, but there was no satisfactory product on the market.
So, the initial positioning is to solve our own problems. In the middle, we also developed a distributed cache. Then we started to solve the problem of database scalability, and we went out to start a business.
Love analysis: database as the underlying technology, customers are very cautious in choosing suppliers. How to get customers in the first place?
Liu Qi: in 2016, after we got round a financing from yunqi capital, we began to consider how to obtain the first batch of users. It is true that there is a risk for users to apply a new database to the Internet. Who is willing to take the risk of their online business to try a new database?
Gaia entertainment is our first user. At that time, there was a problem with their MySQL database, and the online query speed was particularly slow. The whole system was too stuck to use, and it was difficult to carry out business without trying to use new technologies. Our product was still in the testing stage, and they started to push the database online.
Because there is a real risk in using a new database online, many users do it in another way. There is a pile of MySQL running online. They build a large data cluster at the back and collect all the data here. It looks like a data warehouse. Because we are compatible with the protocol, we can copy the data and they can query in real time.
In the game industry or the risk control management with high real-time requirements, they urgently need this technology to solve the problem.
At present, we have disclosed a lot of financial cases, a considerable part of which are used in the scene of real-time risk control. The advantage is that it doesn’t directly target online business, the risk is smaller than online mysql, and it just solves their pain points.
After this stage, if the customer thinks the technology is stable enough, he will take off the line and push our products to the front to support all businesses.
When customers regard our database as a warehouse, the complexity of the query is very high. Our database can help customers to do something they didn’t dare to do before. A SQL query statement is even several pages long.
So the problem is that our design is not for AP business, but the query function focuses on AP. Therefore, when we optimize the actuator, we also made corresponding adjustments and expanded the AP function.
In this way, our products can support both online TP and AP businesses, and our products become HTAP.
When we do this product well, we find that the characteristics of the product are very obvious, there is no strong competitor in this field, and this product is to meet the needs of users. In many cases, the user’s requirements cannot be simply divided into TP or AP. In fact, there is no clear definition, and even the customer does not care about these, just wants to solve their own problems.
Love analysis: in terms of data writing and query, there are differences between rows and columns. How can tidb be implemented in one table?
Liu Qi: ranks are just a form of storage. From a technical point of view, ranks can be changed.
For example, the cold data is slowly converted into column storage in the background, and then the newly written data still uses row storage. The foreground is also a standard row store, which can be converted into row store or column store according to the heat and cold of data.
In fact, the latest paper has put forward a new point of view. The data storage is not purely row storage or column storage, but the frequently accessed data uses row storage according to the access frequency, and does not need to scan the whole table. There are many ways to realize it.
Love analysis: when Google is doing the spanner, it emphasizes its scalability. Is the computing power required relatively low?
Liu Qi: This is a concept of Google before, but in this way, if you do some relatively complex operations, the response time of the database will be longer, which is determined by the storage format.
However, in Google’s 2017 paper, the storage format has been changed to partial hybrid storage. We have the same iteration route as Google, and our storage format has been changed earlier, because we have met the actual needs of users earlier.
Love analysis: is there a certain contradiction between algorithm and scalability, and will complex algorithm affect its scalability?
Liu Qi: algorithm has nothing to do with scalability. Algorithm mainly affects the efficiency of execution.
For example, if it is column deposit, the execution efficiency is higher. For example, the bank will sum the amounts of all accounts. If it is column deposit, it will be very simple. But if it is bank deposit, it needs to scan the amount data in each row. The execution efficiency is very low, but there will not be much difference in the lower level of calculation.
Love analysis: when pushed to the front desk, what adjustments should be made to the database?
Liu Qi: according to the load of the whole system, we will decide how much concurrency to use, and we will do some optimization.
Suppose there are 100 machines and such a data cluster, which is pushed evenly to each machine for calculation. In the case of high concurrency, each robot may be very busy. It is useless to add tasks to it at this time, and the machine will crash.
However, if there is a “smart” scheduler to control the instructions and schedule different machines to perform different operations while maintaining high concurrency, the machine will not be very busy, but the problem is that it will bring a long delay.
Of course, the same data may not have to use CPU to calculate, but can use GPU or FPGA, which has higher requirements for the scheduler. According to the development trend, the ability of the scheduler is an important indicator to measure the performance of a database.
Love analysis: how does tidb achieve real-time?
Liu Qi: because it’s a distributed structure, its performance can continue to expand. It doesn’t matter how much data is input in front of it. If you don’t think it’s fast enough now, you can do it by adding machines.
The speed is also related to the calculation. Some calculations cannot be pushed to all nodes. For example, if I want to take all the data back for sorting, there is no way for all the nodes to do it.
In this case, the role of the optimizer is more important. It will identify which calculations need to be pushed down to do parallel operations and which only need to make decisions.
Love analysis: can MySQL architecture and data migration to tidb be senseless?
Liu Qi: we considered this problem from the beginning of the design. For MySQL, we can do senseless migration. If it is other protocols of Oracle or DB2, it may involve code change.
Love analysis: for other protocols, how long is the migration cycle?
Liu Qi: we need to consider the complexity of the business. For example, if the original business has 100000 SQL statements, we need to verify them once. If the business itself is more complex, it will be faster. On the MySQL protocol side, we can do POC soon.
Love analysis: is there any consideration to support rapid migration of Oracle or DB2 in the next step?
Liu Qi: we don’t have any plans for this, because these technologies are no longer used in the new business. If you think about it, the goal is to cut into old projects. There is a problem of compatibility when cutting into old projects. Users need to know how much compatibility of new technologies is? Can I use the new technology to replace?
Compatibility is not only the compatibility of functions, but also the compatibility of bugs. It is very difficult to achieve 100% compatibility. The original programmers of the enterprise may also leave. If they replace the old business, the workload and risk will be great.
At present, Internet finance, games and other partial Internet industries are key industries, which are suitable for scenarios with large data volume and high business complexity
Love analysis: which industries are the main customers?
Liu Qi: in the process of commercialization, the most important thing is to make the product and then improve its functions according to the needs of customers.
In addition, our products are open source. The advantage of open source is that when users are using it, they will timely feedback their experience and problems, and in the process, we will find out who our potential users are.
Our first user is the game company, which is beyond our expectation. We think it may be the Internet priority, because the Internet is more radical to new technology.
The game industry also has its own characteristics. The most profitable part of the game company is the operation of popular games. The running water in a day may be tens of millions. They hope that their infrastructure is stable and powerful enough. Once they encounter a bottleneck, they will have a lot of losses. Therefore, they also hope to solve the problem through new technology.
Another is the Internet and traditional industries. When Internet enterprises use our new products, they are still very conservative. Because there are so many MySQL in use, they will feel high risk when they suddenly change new technologies.
However, enterprises such as Internet finance still have high requirements for real-time. If they want to conduct risk control management through real-time information, the previous scheme can not be satisfied, so they will choose to use our products.
Love analysis: what are the application scenarios of tidb?
Liu Qi: our database has a strong universality, which is generally for new business needs. We do not design the database as a product for a certain industry.
When it comes to the advantages of our products, the data volume of our customers must be more than 100 million. If the data volume is relatively small, there is no need to use distributed databases. In addition, the complexity of our business is relatively high, so our advantages are more obvious.
Love analysis: which industries will be focused on in the next step?
Liu Qi: from the perspective of revenue, finance should be one of our key industries, and data growth in other fields such as logistics and medical care is also relatively fast.
The team is mainly from Internet companies with few sales staff
Love analysis: the user promotion progress of pingcap in 2017?
Liu Qi: we have 200 users running in the production environment in 2017. The price of product customers is relatively high, and the number of paying users is less.
Love analysis: tidb is an open source technology. What enhancements will be made when providing enterprise level products?
Liu Qi: Although we provide an open source technology, there are still some closed sources, such as monitoring operation and maintenance components, backup tools, security tools, etc.
For enterprise applications, it must have a very beautiful user interface and a lot of operation tools, which is the way our enterprise version provides.
There is another part, we call it Database & service. What we provide is not only a database, but also a database platform. Enterprise users can apply for tidb data cluster. If there is no such thing, it may need to be handled manually by the administrator. The user experience is quite different.
Love analysis: how does tidb charge?
Liu Qi: now we have two considerations: on the one hand, we can use the cloud deployment, and we can see the database entry of Tencent cloud. This business model is relatively simple. Like other products on the cloud, it charges according to the way of leasing.
On the other hand, you can buy our subscription or our license, which is calculated according to the number of nodes.
Love analysis: team size of the company?
Liu Qi: now there are about 100 people in the company, with a relatively high proportion of R & D. there are 82. There is only one salesperson. There are few salespeople because the users are found by themselves. We don’t have much input in this area.
Our requirements for R & D are still very high, including the support and response speed of R & D personnel to the outside world. Although it doesn’t look as exaggerated as Oracle, there are many external companies that are contributing to us.
For example, a lot of scheduler code is contributed by Moby, and Optimization in many scenarios is contributed by today’s headlines, including Samsung Research Institute in South Korea, and many people are helping us to do tests, which also embodies the benefits of open source technology.
Love analysis: will the R & D personnel take part of the pre-sales work?
Liu Qi: there are still some R & D personnel doing pre-sales work in 17 years, but we will make some adjustments in 18 years, which is also a very important task for us.
The construction of personnel structure should form a complete system, in which pre-sales, implementation and R & D should perform their respective duties, and different people should be arranged to solve the problems in different stages.
Love analysis: when there are few salespeople, do they put forward higher requirements for the operation of the community?
Liu Qi: I think there are more R & D personnel and the communication with the community will be faster. The most important user in the community is the developer. The communication with the developer must be that the R & D personnel are more smooth, and the sales personnel cannot replace this role. For example, if some code problems are raised by users, the response speed of R & D will be very fast.
Large scale users like today’s headlines, Moby, and Tongcheng are all actively contacting us because of pain points, and do not need sales to do additional work.
Of course, there are many small users in the community. Although small users do not have the ability to pay, they also have a direct effect on the community.
They will use their own scenarios to test, and find many problems we have never met. The information they provide is also very important to us, so we will spend a lot of effort to run the community.
Love analysis: most of pingcap’s team background is Internet?
Liu Qi: Yes, there are more Internet companies, all of them are large-scale Internet companies, and they have experienced the pain brought by the large amount of data.
In addition, there are also those from the traditional industry and those from the financial industry before sales. He knows more about the use scenarios of the financial industry.
Love analysis: if we cut into the traditional industry, is there any change in the requirements of personnel structure?
Liu Qi: at present, we don’t think so. We hope that we can directly win customers through products and reflect the advantages of our products. If we use the same customer whose database is the same, we will not fight for it, which is not our strength.
Love analysis: how to balance the energy of product development and community maintenance?
Liu Qi: we will definitely make a basic version before we can promote it in the community. When we encounter a bug, we must repair it. Otherwise, it will affect the use of many people. The two will advance together without conflict.
In terms of internal R & D, we will quickly develop many new functions, which will not be immediately applied to the stable version, but first release a beta version in the community, find bugs through user testing, and we will repair them. After continuous communication, we will release the stable version.
In this process, we need to let users test continuously through the community to give us feedback. Because the product is not the one we has the final say, but the user.
The integration of TP and AP is the future trend, and the database market will be more diversified in the future
Love analysis: there is a certain contradiction between consistency and usability in cap principle. How to optimize it?
Liu Qi: we will provide an option in the future, users can choose according to their own needs, high consistency or high availability. For example, the bank’s data requires high consistency, while Internet applications focus more on high availability. We will provide them to users for selection.
Love analysis: what’s the difference between new SQL technology and previous technology?
Liu Qi: in history, SQL was first applied. Later, why did NoSQL appear? It’s because SQL can’t be extended. Although NoSQL has the ability to expand, it’s poor in expression. It may not support transaction processing, and it doesn’t have the traditional advantages of SQL.
Newsql is equivalent to having two advantages at the same time, which not only can be well extended, but also can have the transaction processing ability and expression power of SQL.
Love analysis: is there a trend of convergence between TP and AP?
Liu Qi: we think so. Users don’t care whether it’s TP or AP. It’s the hard truth to solve the problem. Whether it’s online or offline, I’m not willing to wait a day to realize it in real time.
TP and AP are separated because of historical reasons, and they were not distinguished when the database was born. Now that the technology can be done, we still hope to integrate it. In the case of complex data analysis, there may be separate APS, but our products are still in rapid iteration. Finally, it depends on whose performance is better.
Love analysis: will there be another Oracle in the field of distributed database platform in the future?
Liu Qi: for historical reasons, Oracle’s position is irreplaceable in a short period of time, but the new database architecture is also rising rapidly. Now Oracle has encountered unprecedented challenges. I think in the next two years, 20% of the traditional databases will be replaced by new databases.
Looking at the growth rate of our users, this trend is quite obvious.
Love analysis: what will happen to the market structure in the future?
Liu Qi: I think the market will become more diversified.
First of all, the current requirements are very fragmented, and the traditional database can not be well expressed, for example, the requirements for streaming are getting higher and higher.
The advantage of relational database is its universality and balance. However, it is difficult to adapt some scenarios to the current database framework, which will certainly not be smoother than the specially designed database, such as the graph database.
From the perspective of development trend, when NoSQL comes out, you will consider what scenarios it can replace. Later, it was found that NoSQL still has many constraints. The emergence of newsql will indeed change the market pattern. There should be two or three large companies eating up most of the market in the future, but small companies still exist.
Love analysis: will the development of open source technology affect the business of database companies?
Liu Qi: actually, open source technology has existed for a long time, like mysql, which has a history of more than 20 years, but enterprise applications are not so simple after all, and there are many problems that need to be solved by the team.
There will be no free database in the future, even if it is open source, it will be charged.
Love analysis: Internet companies generally develop their own infrastructure, will it affect pingcap?
Liu Qi: this matter should be divided into domestic and foreign ones. Domestic companies like to build private cloud, but there are big differences between foreign companies. Many foreign companies have dismantled their own private cloud. The reason is very simple. The efficiency of deploying private cloud is not as good as using mature public cloud directly.
Now many Internet companies don’t want to be locked in by companies like oracle. I need to use your database and have some control. Because Internet companies grow rapidly, and the change of demand is more obvious, they hope to have a certain understanding and control of the database, so as to facilitate Internet companies to modify data codes and meet their customized needs.
Love analysis: will cloud vendors eventually become competitors of database enterprises?
Liu Qi: the relationship between database and cloud is a bit like the relationship between app and app store. Cloud vendors may also do databases, but more should be a partnership.