Why separate databases and tables (how to design the database level when designing a highly concurrent system)? What database and table middleware have you used? What are the advantages and disadvantages of different database and table middleware? How do you split the database vertically or horizontally?
Interviewer psychological analysis
In fact, this must be involvedHigh concurrencyBecause the sub database and sub table must be forSupport high concurrency and large amount of dataTwo questions. And now to tell the truth, especially in the interview of Internet companies, they basically come here. It’s really impossible not to ask such common technical questions as database and table, and it’s really unreasonable if you don’t know!
Analysis of interview questions
Why do you want to separate databases and tables? (how to design the database level when designing a highly concurrent system?)
To be clear, the sub database table is two things. Don’t confuse them. It may be that the light sub database does not divide the table, or the light sub table does not divide the database, which is possible.
I’ll show you a scene first.
If we are a small start-up company (or a newly emerging Department of bat company), there are 200000 registered users, 10000 active users every day, 1000 single table data every day, and then there are at most 10 concurrent requests per second in the peak period. My God, with this system, you can find anyone with several years of work experience, and then bring some newly trained ones.
As a result, I didn’t expect that we were so lucky. A CEO took us to a prosperous road. Our business developed rapidly. After a few months, the number of registered users reached 20 million! 1 million active users per day! The data volume of a single table is 100000 every day! The maximum request per second reaches 1000 in peak hours! At the same time, the company also took two rounds of financing and received several billion yuan! The company’s valuation has reached an amazing hundreds of millions of dollars! This is the rhythm of the little Unicorn!
Well, it’s all right. Now everyone feels a little stressed. Why? Because there are 100000 more data every day and 3 million more data a month. Now we have millions of data in a single table, which will break tens of millions immediately. But barely. The peak demand is now 1000. We have deployed several machines online. We have done some load balancing, and the database support of 1000qps is also OK. But now everyone is starting to feel a little worried. What’s next
In the next few months, my God, the CEO is so awesome. The number of users of the company has reached 100 million. The company continues to raise billions of yuan! The company’s valuation has reached an amazing billions of dollars and has become the most powerful star start-up company in China this year! God, we’re so lucky.
But we are also unfortunate, because at this time, there are tens of millions of active users every day, and up to 500000 new data are added to a single table every day. At present, the total data volume of a table has reached twenty or thirty million! I can’t carry it! Database disk capacity is constantly consumed! Peak and developed to amazing
5000~8000！ Don’t be kidding, brother. I promise you, your system can’t support it until now. It’s dead!
OK, so you can almost understand what’s going on here. In fact, it follows your company’s business development. The better your company’s business development, the more users, the greater the amount of data and the greater the amount of requests, then you can’t carry a single database.
For example, you have tens of millions of data on a single table. Are you sure you can carry it? Absolutely not,The amount of data in a single table is too large, which will greatly affect your SQLPerformance of execution, in the back, your SQL may run very slowly. Generally speaking, in my experience, when a single table reaches millions, the performance will be relatively poor, and you will score the table.
What does sub table mean? It is to put the data of one table into multiple tables, and then you can check one table when querying. For example, divide tables according to user ID, and put a user’s data in a table. Then, when operating, you can operate the table for a user. In this way, the data volume of each table can be controlled within a controllable range. For example, each table is fixed within 2 million.
What does sub library mean? Generally speaking, our experience is that you can support up to 2000 concurrent databases, and you must expand the capacity. Moreover, you’d better keep the concurrency value of a healthy single database at about 1000 per second, not too large. Then you can split the data of one database into multiple databases, and access one database when accessing.
This is calledSub database and sub table, why do you need to divide the database into tables? You see.
What database and table middleware have you used? What are the advantages and disadvantages of different database and table middleware?
This is actually to see what database and table middleware you know, and what are the advantages and disadvantages of each middleware? Then, what database and table middleware have you used.
It is developed and open source by Alibaba B2B team and belongs to the proxy layer scheme, which is between the application server and the database server. The application accesses the Cobar cluster through the jdbc driver. Cobar decomposes the SQL according to the SQL and sub database rules, and then distributes it to different database instances of the MySQL Cluster for execution. It can be used in the early years, but it has not been updated in recent years. It is basically not used by many people. It is almost abandoned. Moreover, operations such as read-write separation, stored procedures, cross library join and paging are not supported.
It is developed by Taobao team and belongs to the client layer scheme. It supports basic crud syntax and read-write separation, but does not support join, multi table query and other syntax. At present, it is not used much, because it still relies on Taobao’s diamond configuration management system.
360 is open source and belongs to the proxy layer scheme. Some companies used it before, but there is a big problem that the latest maintenance of the community was five years ago. Therefore, there are few companies now.
Dangdang is open source and belongs to the client layer scheme. Yes<
a href="https://Shardingsphere. Apache. Org / “> client layer scheme of shardingsphere<
a href="https://Shardingsphere. Apache. Org / “> shardingsphere also provides a proxy layer scheme, sharding proxy. Indeed, it has been used more before, because SQL syntax support is also more, there are not too many restrictions, and it has been launched as of April 2019
4.0.0-RC1Version, which supports database and table splitting, read-write separation, distributed ID generation, and flexible transactions (best effort delivery transactions and TCC transactions). In addition, there are many companies that have been used before (for this company registered on the official website, we can see that many companies have been using it since 2017). At present, the community has also been developing and maintaining, which is still relatively active. Personally, I think it is nowAlternative schemes。
Based on Cobar transformation, it belongs to the proxy layer scheme, and the supported functions are very perfect. At present, it should be a very popular and increasingly popular database middleware. The community is very active, and some companies are beginning to use it. However, compared with sharding JDBC, it is younger and less tempered.
To sum up, sharding JDBC and MYCAT are actually recommended for consideration. Both of them can be considered.
Sharding JDBC is a client layer solutionThe advantages are no deployment, low operation and maintenance cost, no secondary forwarding request of the agent layer, and high performanceHowever, if you need to upgrade, each system needs to be upgraded and released againcouplingSharding JDBC dependency;
MYCAT is a proxy layer solutionThe disadvantage is that it needs to be deployedHowever, the operation and maintenance cost of a set of middleware is highThe advantage is that it is transparent to each project, if you have to upgrade your own middleware, you can do it.
Generally speaking, these two schemes can be selected, but I personally suggest that small and medium-sized companies choose sharding JDBC. The client layer scheme is lightweight, low maintenance cost, no additional manpower is required, and the system complexity of small and medium-sized companies will be lower, and there are not so many projects; However, it is better for medium and large companies to choose proxy layer schemes such as MYCAT, because large companies may have many systems and projects, large teams and sufficient personnel, so it is best to find someone to study and maintain MYCAT, and then use a large number of projects directly and transparently.
How do you split the database vertically or horizontally?
split horizon It means to get the data of one table into multiple tables of multiple databases, but the table structure of each database is the same, except that the data placed in each database table is different, and the data of all database tables add up to all the data. The meaning of horizontal splitting is to evenly place data in more databases, and then use multiple databases to carry higher concurrency. In addition, it is to expand the storage capacity of multiple databases.
Vertical splitting means splitting a table with many fields into multiple tables or libraries. The structure of each library table is different, and each library table contains some fields. Generally speaking, fewer fields with high access frequency will be placed in one table, and then more fields with low access frequency will be placed in another table. Because the database has a cache, the fewer rows and fields you access frequently, the more rows you can cache in the cache, and the better the performance. This is generally done more at the table level.
This is actually quite common. I don’t necessarily say that many students may have done it by themselves. They take apart a large table, such as order table, order payment table and order commodity table.
alsoTable level splitting, which is to divide a table into n tablesControl the data volume of each table within a certain rangeTo ensure the performance of SQL. Otherwise, the larger the amount of data in a single table, the worse the SQL performance. Generally, it is about 2 million lines, not too much, but it also depends on how you operate. It may also be 5 million or 1 million. The more complex your SQL is, the better to reduce the number of rows in a single table.
Well, the database middleware mentioned above can support both database and table. Basically, the middleware can divide databases and tables,Middleware can specify a field value according to youFor example, userid,Automatically route to the corresponding database, and then automatically route to the corresponding table。
You have to think about how to divide the database and table in your project? Generally speaking, vertical splitting can be done on the surface surface to split some tables with many fields; For horizontal splitting, you can say that it can’t be carried concurrently, or the amount of data is too large and the capacity can’t be carried. If you dismantle it, you can decide what field to dismantle by yourself; Split the tables. If you even split them into each database, the concurrency and capacity are OK, but the tables in each database are still too large, you can split the tables and separate the tables to ensure that the amount of data in each table is not very large.
And there are twoMethod of dividing database and table：
- One is to divide by range, that is, each library has a continuous section of data, which is generally divided by, for exampletime frameYes, but this is generally less used, because it is easy to generate hot issues, and a large amount of traffic is hit on the latest data.
- Or it is more commonly used to hash evenly according to a field.
The advantage of dividing by range is that the expansion is very simple, because as long as you are ready, you can prepare a database every month. When it comes to a new month, naturally, you will write a new database; Disadvantages, but most requests are to access the latest data. The range used in actual production depends on the scene.
The advantage of hash distribution is that it can evenly distribute the amount of data and request pressure of each database; The disadvantage is that the expansion is troublesome, and there will be a data migration process. The previous data needs to recalculate the hash value and reassign it to different libraries or tables.
I am a java engineer. To learn more about the basics of Java, you can join my java novice exchange group: 3907814. You can ask any questions in it