https://v.youku.com/v_show/id…

In the last issue, we introduced the complex query function. In this issue, we will continue to explain the tables with different splitting rules.

Now I have four nodes, four nodes are divided by modulus, TB_ There are five pieces of data on mod. There is a bracket for 67, which means that if there is 67, it will be such a data distribution. After split by jump hash algorithm, 12.3 is distributed on DN2 and 46 is distributed on DN1. Now I’m going to make a TB_ Mod and jump_ Hash join. It is wrong to issue the join directly. As you can see from the figure, only the 1 of DN2 and 4 of DN1 can be associated, and both 2 and 3 will be lost, so let’s take a look at cross database join.

## Cross database join

https://v.youku.com/v_show/id…

I will have three examples: inner join, left join and right join. Because jump_ The ID and code in hash are the same, so I write ID here.

Let’s run it. We can see that the data of 1234 has been published. The result of the left join is also right. We are familiar with the left join. If the data on the left is not on the right, it will be found out. The same is true of right join. Let’s see how to do it. We add an explain before it, which is our distributed query plan.

## Distributed query plan

Because my terminal is a little bit large and a little messy for clarity, we simplify the query result, change the select * to only the columns needed, and slightly reduce the horizontal length. Otherwise it looks a little tired. Let’s add another g and take a look at the table.

We can see that there are 12 lines. From the beginning, we focus on the first few lines. Then we find that the first four lines are the same, the only difference is the datanode. The other columns in the first four rows are the same, representing that my query is distributed to these four nodes. After the results of the four nodes come back, they are merged and sorted in my middleware. We’ll explain that later. Then there will be column integration and filtering. For the current example, there is no actual work to do in this step, but it is just forwarded to the next step.

Then, the second table jump hash is issued to two nodes by SQL. Then the two nodes will also merge after they come back. We can see that there is a serial number after DN2, which is to distinguish it from the previous DN1 DN2. Then shuffle the field, and finally make a join. Join is to merge shuffle just now_ The result set of field is merged. The third shuffle field is still a forwarding process. Finally, I can get the result after a join. Through such an implementation plan, we can understand cross database join.

In this example, the join is done completely in the dble middleware, so the two nodes collect and sort the data, and then merge to join after collecting the data.

## Implementation of sorting

Now I’d like to tell you why we just have a sort. In theory, I don’t do sorting. I go to each data node and get the data back. Joining in middleware is actually to match each row of data in the right table with each row of data on my left, that is, to make a Cartesian product. If I sort the nodes in the middleware, you can imagine. For example, in our example, one table is 12345 and the other is 12346. When joining, if I am an ordered one, it does not need Cartesian product. It only needs to compare the team head, which means the first row comparison. This 1 and 1 are compared, and they are matched. Next, 1 and 2 are compared and found to be not matched. Because the table data is ordered, once it fails to match, it will no longer match, so the 1 on the left is out.

So through this optimization, we can reduce the time complexity. In middleware, the time complexity is much smaller than Cartesian product when the sum of data rows of two tables is multiplied by 2. Of course, the corresponding consumption is the consumption of MySQL. There was no need to sort, but now it needs sorting. For us, we may think that it is an effective way for middleware to do as little calculation as possible and send the calculation to MySQL. This is a table of different splitting rules that we show.

OK, let’s introduce it here.

In order to facilitate reading, some spoken words are optimized without affecting learning. The manuscript and the video will be consistent as far as possible.