Histogram, a special type of column statistics, can provide more detailed data distribution information of the columns in the table. Histogram stores the values in buckets. Based on the number of different values and the distribution of data, the database selects the type of histogram to be created. The types of histogram are as follows:

- Frequency histograms and to frequency histograms;
- Height balanced histograms (Legacy): height balanced histograms;
- Mixed histogram: hybrid histograms;

# 1 histogram introduction

## 1.1 purpose of histogram

By default, the optimizer assumes that different values of a column are evenly distributed. For columns with skewed data (columns with uneven distribution of data in columns), histograms enable the optimizer to generate more accurate cardinality estimates for filtering or join predicates involving these columns, thereby generating more accurate execution plans.

## 1.2 when to create histogram in database

Using DBMS_ Stats collects the statistical information of the table. When querying the columns in the reference table, the database will automatically create the histogram according to the previous query load. The basic process is as follows:

- Using DBMS_ Stats collects statistics of the table and specifies method_ The opt parameter is size auto by default;
- The user queries the corresponding table;
- The database records the predicates used in the previous query and updates the data dictionary table SYS.COL_ USAGE$；
- Run the DBMS again_ In stats, DBMS_ Stats will query SYS.COL_ Use $view and decide which columns need histogram based on the previous query load.

Example:

1) Create test table

```
SQL> create table sh.sales_new as select * from sh.sales;
Table created.
```

2) View statistics

```
SQL> select column_name,notes,histogram from dba_tab_col_statistics where owner='SH' and table_name='SALES_NEW';
COLUMN_NAME NOTES HISTOGRAM
------------------------------ ------------------------------ ---------------
AMOUNT_SOLD STATS_ON_LOAD NONE
QUANTITY_SOLD STATS_ON_LOAD NONE
PROMO_ID STATS_ON_LOAD NONE
CHANNEL_ID STATS_ON_LOAD NONE
TIME_ID STATS_ON_LOAD NONE
CUST_ID STATS_ON_LOAD NONE
PROD_ID STATS_ON_LOAD NONE
7 rows selected.
```

3) Execute query

```
SQL> select column_name,notes,histogram from dba_tab_col_statistics where owner='SH' and table_name='SALES_NEW';
COLUMN_NAME NOTES HISTOGRAM
------------------------------ ------------------------------ ---------------
AMOUNT_SOLD STATS_ON_LOAD NONE
QUANTITY_SOLD STATS_ON_LOAD NONE
PROMO_ID STATS_ON_LOAD NONE
CHANNEL_ID STATS_ON_LOAD NONE
TIME_ID STATS_ON_LOAD NONE
CUST_ID STATS_ON_LOAD NONE
PROD_ID STATS_ON_LOAD NONE
7 rows selected.
```

4) Collect statistics

```
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('SH','SALES_NEW',OPTIONS=>'GATHER AUTO');
PL/SQL procedure successfully completed.
```

5) View statistics

```
SQL> select column_name,notes,histogram from dba_tab_col_statistics where owner='SH' and table_name='SALES_NEW';
COLUMN_NAME NOTES HISTOGRAM
------------------------------ ------------------------------ ---------------
AMOUNT_SOLD STATS_ON_LOAD NONE
QUANTITY_SOLD STATS_ON_LOAD NONE
PROMO_ID STATS_ON_LOAD NONE
CHANNEL_ID STATS_ON_LOAD NONE
TIME_ID STATS_ON_LOAD NONE
CUST_ID STATS_ON_LOAD NONE
PROD_ID HISTOGRAM_ONLY FREQUENCY
7 rows selected.
```

6) View column usage

```
SQL> select * from sys.col_usage$ where obj#=93264;
OBJ# INTCOL# EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP
---------- ---------- -------------- -------------- ----------------- ----------- ---------- ---------- ---------
93264 1 1 0 0 0 0 0 25-APR-20
```

## 1.3 how to select histogram type

As mentioned earlier, there are many types of histograms. How can the database select the type of histograms when creating histograms? Here are several reference variables:

- NDV: the number of different values of a column;
- n: Represents the number of buckets in the histogram, 254 by default;
- p: Represents the internal percentage threshold, equal to (1 – (1 / N)) * 100;
- DBMS_ Estimate in stats_ Is the percentage parameter set to auto_ sample_ Size (the default).

The following figure shows the decision tree of histogram creation

# 2 histogram cardinality algorithm

For histograms, the cardinality algorithm depends on factors such as the number of endpoints and values, and whether column values are popular.

## 2.1 endpoint numbers and values

The endpoint number is the number that uniquely identifies the bucket. In the frequency and mixed histogram, the endpoint number is the cumulative frequency of all the values contained in the current bucket and the previous bucket. For example, a bucket with an endpoint number of 100 indicates that the total frequency of the values of the current bucket and all the previous buckets is 100. In the highly balanced histogram, the optimizer numbers the buckets in order, starting from 0 or 1. In all cases, the endpoint number is the bucket number.

The endpoint value is the maximum value in the bucket median range. For example, if a bucket contains only 52794 and 52795, the endpoint value is 52795.

## 2.2 popular and unpopular values

The popularity of a value in the histogram will affect the cardinality estimation algorithm, as follows:

- Popular value: the popular value appears in the endpoint value of multiple buckets. The optimizer determines whether a value is popular by checking whether it is the endpoint value of the bucket. If so, for the frequency histogram, the optimizer subtracts the endpoint number of the previous bucket from the endpoint number of the current bucket. The mixed histogram stores the information of each endpoint. If the value is greater than 1, the value is popular Welcome. For popular values, the optimizer calculates cardinality estimation by the following formula: cardinality of popular value = (num of rows in table) * (num of endpoints spanned by this value / total num of endpoints);
- Unpopular value: all unpopular values are unpopular values. For unpopular values, the optimizer calculates cardinality estimation through the following formula: cardinality of nonpopular value = (num of rows in table) * density.

## 2.3 bucket compression

In some cases, in order to reduce the total number of buckets, the optimizer compresses multiple buckets into one bucket. For example, the following frequency histogram shows that the first bucket number is 1 and the last bucket number is 23

```
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792
6 52793
8 52794
9 52795
10 52796
12 52797
14 52798
23 52799
```

As you can see, several buckets are “lost”. Initially, buckets 2 to 6 each contain an instance with a value of 52793. The optimizer compresses all these buckets into the bucket with the highest end points (bucket 6). The bucket now contains the value of 52793 of five instances. This value is popular because the difference between the number of endpoints of the current bucket and the previous bucket is 5. Therefore, before compression, 52793 is 5 The endpoint of a bucket. The following figure shows which buckets are compressed and which values are popular:

```
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792 -> nonpopular
6 52793 -> buckets 2-6 compressed into 6; popular
8 52794 -> buckets 7-8 compressed into 8; popular
9 52795 -> nonpopular 10 52796 -> nonpopular
12 52797 -> buckets 11-12 compressed into 12; popular
14 52798 -> buckets 13-14 compressed into 14; popular
23 52799 -> buckets 15-23 compressed into 23; popular
```

# 3 frequency histogram

In frequency histogram, each column value corresponds to a histogram bucket. Because each value has its own special bucket, some buckets have many values, while others have few.

## 3.1 conditions of frequency histogram

The database creates a frequency histogram when the following conditions are met:

- NDV is less than or equal to the number of barrels (default is 254);
- DBMS_ The parameters of the process corresponding to stats are set to auto_ SAMPLE_ Size or specify a specific value;

## 3.2 generating frequency histogram

The experiment was carried out in the laboratory sh.countries_ New column country_ subregion_ ID generates frequency histogram.

1) Generate test data

```
SQL> create table sh.countries_new as select * from sh.countries;
Table created.
SQL> select country_subregion_id,count(1) from sh.countries_new group by country_subregion_id order by 1;
COUNTRY_SUBREGION_ID COUNT(1)
-------------------- ----------
52792 1
52793 5
52794 2
52795 1
52796 1
52797 2
52798 2
52799 9
8 rows selected.
```

2) Collect statistics

```
begin
dbms_stats.gather_table_stats(ownname => 'SH',
tabname => 'COUNTRIES_NEW',
method_opt => 'for columns country_subregion_id');
end;
/
PL/SQL procedure successfully completed.
```

3) View column statistics

```
SQL> select column_name,notes,histogram from dba_tab_col_statistics where owner='SH' and table_name='COUNTRIES_NEW';
COLUMN_NAME NOTES HISTOGRAM
------------------------------ ------------------------------ ---------------
COUNTRY_NAME_HIST STATS_ON_LOAD NONE
COUNTRY_TOTAL_ID STATS_ON_LOAD NONE
COUNTRY_TOTAL STATS_ON_LOAD NONE
COUNTRY_REGION_ID STATS_ON_LOAD NONE
COUNTRY_REGION STATS_ON_LOAD NONE
COUNTRY_SUBREGION_ID FREQUENCY
COUNTRY_SUBREGION STATS_ON_LOAD NONE
COUNTRY_NAME STATS_ON_LOAD NONE
COUNTRY_ISO_CODE STATS_ON_LOAD NONE
COUNTRY_ID STATS_ON_LOAD NONE
10 rows selected.
```

You can see country_ SUBREGION_ ID has collected histogram information.

4) View histogram information

```
select t.endpoint_number, t.endpoint_value
from dba_histograms t
where t.owner = 'SH'
and t.table_name = 'COUNTRIES_NEW'
and t.column_name = 'COUNTRY_SUBREGION_ID';
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792
6 52793
8 52794
9 52795
10 52796
12 52797
14 52798
23 52799
8 rows selected.
```

5) The optimizer evaluates the cardinality of 52799

```
SQL> select count(1) from sh.countries_new;
COUNT(1)
----------
23
SQL> select count(1) from sh.countries_new where country_subregion_id=52799;
COUNT(1)
----------
9
```

cardinality of popular value = (num of rows in table) * (num of endpoints spanned by this value / total num of endpoints)

That is: C = 23 * (9 / 23) = 9

The result is the same as that of the query

# 4 maximum frequency histogram

The highest frequency histogram is a variant of the frequency histogram, which ignores the unpopular values which are not statistically important.

## 4.1 conditions of the highest frequency histogram

It is useful to create a frequency histogram on a small number of values, even if the NDV is larger than the number of buckets in the requested histogram, if a small number of values account for most of the branches. To create a higher quality histogram for the popular values, the optimizer ignores the unpopular values and creates a histogram.

When the following conditions are met, the database creates the highest frequency histogram:

- NDV is greater than the number of histogram buckets (default is 254);
- The percentage of rows occupied by the first N frequency values is equal to or greater than the threshold P, P is equal to (1 – (1 / N)) * 100;
- BMS_ The parameters of the process corresponding to stats are set to auto_ SAMPLE_ SIZE；

## 4.2 generating the highest frequency histogram

The experiment was carried out in the laboratory sh.countries_ New column country_ subregion_ ID generates frequency histogram.

1) Generate test data

```
SQL> select country_subregion_id,count(1) from sh.countries_new group by country_subregion_id order by 1;
COUNTRY_SUBREGION_ID COUNT(1)
-------------------- ----------
52792 1
52793 5
52794 2
52795 1
52796 1
52797 2
52798 2
52799 9
8 rows selected.
```

2) Collect statistics

```
begin
dbms_stats.gather_table_stats(ownname => 'SH',
tabname => 'COUNTRIES_NEW',
method_opt => 'for columns country_subregion_id size 7 ');
end;
/
PL/SQL procedure successfully completed.
```

3) View column statistics

```
SQL> select column_name,notes,histogram from dba_tab_col_statistics where owner='SH' and table_name='COUNTRIES_NEW';
COLUMN_NAME NOTES HISTOGRAM
------------------------------ ------------------------------ ---------------
COUNTRY_NAME_HIST STATS_ON_LOAD NONE
COUNTRY_TOTAL_ID STATS_ON_LOAD NONE
COUNTRY_TOTAL STATS_ON_LOAD NONE
COUNTRY_REGION_ID STATS_ON_LOAD NONE
COUNTRY_REGION STATS_ON_LOAD NONE
COUNTRY_SUBREGION_ID TOP-FREQUENCY
COUNTRY_SUBREGION STATS_ON_LOAD NONE
COUNTRY_NAME STATS_ON_LOAD NONE
COUNTRY_ISO_CODE STATS_ON_LOAD NONE
COUNTRY_ID STATS_ON_LOAD NONE
10 rows selected.
```

You can see country_ SUBREGION_ ID has collected the top-frequency histogram information.

4) View histogram information

```
select t.endpoint_number, t.endpoint_value
from dba_histograms t
where t.owner = 'SH'
and t.table_name = 'COUNTRIES_NEW'
and t.column_name = 'COUNTRY_SUBREGION_ID';
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792
6 52793
8 52794
9 52796
11 52797
13 52798
22 52799
7 rows selected.
```

# 5 height balance histogram (Legacy)

In the height balanced histogram, the column values are divided into buckets so that each bucket contains approximately the same row of data.

To be continued…