A series of thoughts on distinct questions

Time:2020-7-15

Some people put forward such a question, sort it out for everyone to refer to

Suppose there is a table like this:

The data here has the following characteristics: in a departmentid, there may be multiple names, and vice versa. In other words, the relationship between name and departmentid is many to many.

Now you want to implement a query: after sorting by department ID (step 1), get the non duplicate values of the name column (step 2), and keep the relative order after the first step. In this case, it shouldReturns three values in turnYes:ACB

The first thing we’ll think of is the following


select distinct name from Sample order by DepartmentId

Semantically, this is natural. Unfortunately, this statement cannot be executed at all. The error message is:

This error means that if distinct (de duplicate value) is used, the field after orderby must also appear after select. However, if departmentid really appears after select, there will be no duplicate value, so the result is certainly wrong.


select distinct name,DepartmentId from Sample order by DepartmentId


 

So, since the combination of diinct and orderby will have this problem, is it possible for us to make some modifications, such as the following:


SELECT distinct a.Name
FROM (select top 100 percent name from Sample order by DepartmentId) a

To compare the previous writing method, we use subquery technology. Similarly, from the semantic point of view, still hot is very intuitive. I want to sort by departmentid first, and then repeat the values. But the return result is as follows:

Although it is true that duplicate values are removed, the order of return is not correct. We want to sort by departmentid first, then remove duplicate values, and keep the sorted relative order.

Why does this result appear? In fact, distinct itself will sort, and this behavior cannot be changed (as can be seen in the execution plan below). So in fact, the order by we did before will lose its meaning here. [in fact, if you observe ADO.NET A similar query generated in ORM tools such as Entity Framework will automatically discard the setting of order by]

In this case, is it impossible to realize the requirements? Although this requirement is rare, most of the time, as the last operation, it is reasonable to make a sort.

I think that since the behavior of distinct is built-in, can this operation be bypassed? In the end, one solution I used was: can I give each name a number, for example, if there are two A’s, I’ll number the first a as 1, the second as 2, and so on. Then, when querying, I first sort and then filter the names with the number of 1. In this way, we can de duplicate values.

SQL Server 2005 begins to provide a row_ In combination with the function of number, I have implemented the following query:


select a.Name from 
(select top 100 percent
Name,DepartmentId,ROW_NUMBER() over(partition by name order by departmentid) row
from Sample order by DepartmentId) a
where a.row=1
order by a.DepartmentId

Then, I got the following results, which I thought should be in line with the requirements mentioned earlier

In comparison, the efficiency of this query will be lower, which is predictable (see the figure below). But if the requirements are rigid, it’s not surprising to sacrifice some performance. Of course, we can study it again to see if there are better ways to write it. In any case, implementations using built-in standards are usually relatively fast.

The above is a series of thinking about the distinct question, hoping to be helpful to everyone’s study.