This article is similar to the source reading of an article, a preliminary introduction to the PostgreSQL statistical information module. The reason why we choose PostgreSQL instead of other databases is that PostgreSQL always performs well in various papers when we see some comparison of design estimation.

## Statistics collected in PG

In PostgreSQL, the collected statistical information is divided into three categories: statistical information collected for a table, statistical information collected for a column, and statistical information collected for a group of columns.

### Statistics collected for tables

Statistics collected for a table mainly record how many rows and pages the table has. These two information will also be maintained for each index. Although the number of rows in the index of the same table is the same, the number of pages will obviously be different.

### Statistics collected for individual columns

The statistics collected for a single column roughly describe the data distribution and size of the column. In PostgreSQL, it collects the following information for each column:

- Histogram: histogram, this data structure is used to describe the distribution of data. This data structure is also described in detail in tidb source code reading Statistics (Part 1). Interested students can see a more detailed introduction in this article.
- Most common values: the most frequent group of values. Kicking them out of the histogram can reduce the estimation error caused by extreme values.
- Distinct number: the number of different values in this column. It is worth noting that PostgreSQL does not maintain a different value for each bucket of histogram.
- Null values: how many rows have null values. Because null is a very special value, it will be maintained separately
- Average value width in bytes: the average length of the column. Recording this value can be used to estimate the memory size used by SQL and IO overhead in more detail.
- Correlation: the degree of sequential correlation between the index and the primary key (or row ID). For example, if a time index always inserts the data of the latest day, then its correlation order with the primary key will be very high. After we get the order dependence, we can estimate how many random IO will be caused by a read back to the table. And for
`where index col = xxx order by primary_key limit y`

With such a query, we can make a more accurate decision whether to read the index or directly read the table.

### Statistics collected for multiple columns

PostgreSQL does not collect statistics for indexes directly, but needs to collect statistics for certain columns through statements. Here it collects functional dependency and multivariate n-distinct counts. Next, we will introduce two kinds of statistical information.

#### Functional Dependency

In the database course, we learned that when column a takes a certain value, column B always takes the same value, so there is a functional dependency of column B on column a. In the actual database, it is difficult to find very strict functional dependencies, so PostgreSQL also records the degree of functional dependencies. After maintaining this value, PG can reduce the estimation error caused by the independent and uncorrelated assumption between dependent columns.

as`where zip_code = … and province = ...`

Obviously, the zip code and the province are not completely independent. If we maintain the function dependent information, we can do the estimation without assuming that the zip code and the province are independent.

For a given n column in PostgreSQL, the following simple algorithm is used to maintain the dependency between the n column and the previous n-1 column:

- The calculation of functional dependency based on sampled data is too time-consuming because there are many sorting operations in the middle;
- First, enumerate all possible permutations between columns;
- For each group, we sort them in the corresponding order;
- After sorting, we group according to the first (n-1) column;
- For each group, we check whether only one value exists in the last column. If yes, then successfulsize + = currentgroupsize;
- Finally, the degree of functional dependency is successfulsize / totalrows.

#### Multivariate N-Distinct Counts（MCV）

This information maintained by PostgreSQL can be roughly regarded as the most common values on multiple columns. Because the structure is more loose than histogram, it can be used to estimate

- (a = 1) and (b = 2)
- (a < 1) and (b >= 2)
- (a is null) or (b is not null)
- (a < 1) or (b >= 2)

It is not necessary to estimate the next column only when the prefix column is always equivalent.

PostgreSQL calculates MCV in a way similar to functional dependency. What’s special is that it not only records the frequency of the most common values (occurrences / total rows), but also records the frequency if the columns are completely uncorrelated. The logic here is rather complicated, and this paper will not give a detailed explanation here.

## How does PG use statistical information to estimate a single table

### clauselist_selectivity

The statistical information entry of PostgreSQL is the function clauselist_ selectivity

This function accepts an array of predicates in the form of CNF (the predicates in the array are concatenated by and). First, it will try to use extended Statistics (i.e. multi column Statistics) to estimate the predicates, and then use single column statistics to estimate the remaining predicates. The two entries are`statext_clauselist_selectivity`

and`clauselist_selectivity_simple`

。

### statext_clauselist_selectivity

stay`statext_clauselist_selectivity`

In this case, PostgreSQL will use only one MCV that it considers to be the best, instead of using multiple MCVs to estimate based on independent and unrelated assumptions. When judging which MCV is the best, it uses a simple greedy algorithm, that is, to see how many predicates the MCV covers. For the logic of MCV, in the function`statext_mcv_clauselist_selectivity`

In the middle.

### dependencies_clauselist_selectivity

After using MCV processing, it will start to use functional dependencies to further filter predicates. For two columns of functional dependencies, P (a, b) = P (a)* (f + (1-f) *P (b)) where f is the degree of functional dependence between (a, b). For multi column functional dependencies, P (a, B, c) = P (a, b)* (f + (1-f) *P (c)) is reduced to the calculation between P (a, b) and P (c). This part of the logic is in the function`dependencies_clauselist_selectivity`

In the middle.

After using the two kinds of multi column statistical information, it is to use the remaining single column statistical information to estimate based on the assumption that each column / predicate is independent and uncorrelated.

### clauselist_selectivity_simple

function`clauselist_selectivity_simple`

It is a simple wrapper, mainly dealing with range predicates on a single column to prevent errors caused by independent assumptions. That is, for a > 100 and a < 1000, it will maintain an interval information for A. after all predicates are processed, the column will be estimated according to the interval information. For other predicates that do not need to maintain interval information, it uses functions directly`clause_selectivity`

Estimate.

### clause_selectivity

stay`clause_selectivity`

The most crude estimate is based on a variety of assumptions

- For a DNF (predicates connected by or), s = S1 + S2 – S1 * S2 is used to estimate
- For the not expression, s = 1 – S1 is used for estimation
- For a CNF, jump back to the upper layer and use clauselist_ Selectivity
- ….

As you can see, the estimation is based on the assumption of independent distribution.

## PG stats summary

The following figure shows the whole process of PG estimation

### Similarities and differences with tidb

Tidb lacks the multi column statistical information (MCV and functional dependency) in PostgreSQL, but has multi column histogram. PostgreSQL currently does not maintain histograms for multiple columns. The current practice of PostgreSQL decouples the statistical information and index, so that the required statistical information can be directly established for the column combination that is not index, and it is also convenient for the maintenance and management of statistical information to some extent.

Currently, tidb does not use s = S1 + S2 – S1 * S2 to estimate DNF, but simply uses a magic number (0.8) to represent the selection rate of DNF.

In other processes, tidb and PostgreSQL are basically the same.

## How PG estimates multiple tables using statistical information

Here we mainly introduce how PostgreSQL estimates inner join.

PostgreSQL mainly uses most common values to estimate the result set of join. Firstly, it calculates the following parts

- match_ prod_ Freq: the selection rate obtained by using only MCV in the left and right tables, that is, the sum of the selection rates of the values appearing in both MCVs;
- match_ Freq1: how many values in MCV 1 are matched in MCV 2;
- match_ Freq2: similarly;
- unmatch_ Freq1: how many values in MCV 1 are not matched in MCV 2;
- unmatch_ Freq2: similarly;
- other_ Freq1: how many values in Table 1 do not appear in MCV.

Then the complete selection rate is: the selection rate calculated between MCVs + the selection rate of values not appearing in MCV 1 and matching with MCV 2 + the selection rate of values not appearing in MCV 2 and matching with MCV 1 + the selection rate of values not appearing in MCV 1 and matching with values not appearing in MCV 2

The above formula is only calculated based on the perspective in Table 1, and then the selection rate of perspective in Table 2 is the complete calculation result.

The above logic lies in the function`eqjoinsel_inner`

You can find it.

Strangely enough, there seems to be no treatment for the case that the join key is multi column (T1 join T2 where T1. A = T2. A and T1. B = T2. B). Normally, if the independent and unrelated assumption is used completely, the estimation is prone to large deviation. Maybe some logic here has been ignored by me. In the future, I will consider using actual data to debug PostgreSQL To further understand its estimation logic.

## summary

**In this paper, the statistical information maintained by PostgreSQL and the estimation framework are briefly introduced by following the code, and the more detailed estimation logic has not yet been touched. In the future, we will continue to conduct a more detailed investigation on the estimation framework of PostgreSQL to see if there is anything worth learning.**