# R data visualization – ggplot statistical layer

Time：2022-5-9

## preface

Although we introduced so many sections`ggplot2`, we basically use when drawing layers`geom_*()`Function, but rarely used`stat_*()`Function.

Of course, use`geom_*()`Function can already complete most of the drawing work, so it is necessary to use it`stat_*()`Function?

Let’s take a look at an example. Suppose there are the following data

``````> select(diamonds, cut, price)
# A tibble: 53,940 x 2
cut       price
<ord>     <int>
1 Ideal       326
3 Good        327
5 Good        335
6 Very Good   336
7 Very Good   336
8 Very Good   337
9 Fair        337
10 Very Good   338
# … with 53,930 more rows
``````

We want to draw a histogram to show the average price of each cut.

The conventional method is to use`tidyverse`To sort out the data, then calculate the required statistical values and map them to the corresponding graphic attributes, that is

``````select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(
mean_price = mean(price),
.groups = "drop"
) %>%
ggplot(aes(cut, mean_price, fill = cut)) +
geom_col()
``````

Now, we are not satisfied with this. Now, we want to add error bars to the histogram

Of course, this is also very simple. We can make statistical calculation on the data and then draw

``````select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(
mean_price = mean(price),
.groups = "drop",
se = sqrt(var(price)/length(price))
) %>%
mutate(lower = mean_price - se, upper = mean_price + se) %>%
ggplot(aes(cut, mean_price, fill = cut)) +
geom_col() +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.5)
``````

`en...`, in order to draw such a simple picture, the code we write is longer than the picture.

Because our idea is still to prepare the data first, and then map the data to graphic attributes.

This leads to the need for a lot of statistical calculations on the data, which is not in line with the neat way of data.

We can think like this. Since all statistical information comes from the same data, why don’t we directly transfer the data to`ggplot`, let the statistical calculation of data be carried out internally?

We can rewrite it like this

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, fill = cut)) +
stat_summary(geom = "bar") +
stat_summary(geom = "errorbar", width = 0.5)
``````

Two lines of code can be done. Why do you have to write so much? It’s good to save time and have a cup of tea.

## Principle analysis

Learning and understanding`stat_summary`Function works, so what else`stat_*`Functions are easy to understand.

How do we understand that`stat_summary`And? Let’s take an example

Using the above data, we draw the point diagram of cutting and price

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
geom_point()
``````

Then use the without parameters`stat_summary`To replace`geom_point`See what happens

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
stat_summary()
``````

Drawn is`pointrange`Object.

Let’s see first`stat_summary`function

``````stat_summary(
mapping = NULL,
data = NULL,
geom = "pointrange",
position = "identity",
...,
fun.data = NULL,
fun = NULL,
fun.max = NULL,
fun.min = NULL,
fun.args = list(),
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE,
fun.y,
fun.ymin,
fun.ymax
)
``````

The default drawing is`pointrange`, that`pointrange`What attribute mappings need to be defined?

• `x`or`y`
• `ymin`or`xmin`
• `ymax`or`xmax`

However, we have no definition`ymin``ymax`, that should be`stat_summary`The corresponding value is calculated and passed to`pointrange`

How to verify our conjecture? First, we see that running the above code will output a warning message

``````No summary function supplied, defaulting to `mean_se()`
``````

By default, i.e`mean_se()`Function transformation

Let’s see`mean_se()`What did you do

``````> mean_se
function (x, mult = 1)
{
x <- stats::na.omit(x)
se <- mult * sqrt(stats::var(x)/length(x))
mean <- mean(x)
new_data_frame(list(y = mean, ymin = mean - se, ymax = mean +
se), n = 1)
}
<bytecode: 0x7fca56dfa5d0>
<environment: namespace:ggplot2>
``````

We can see that the data frame returned by this function contains three values, exactly`pointrange`Parameters to be passed in

We can use L`ayer_data()`Function to extract the data used in the layer

``````> p <- select(diamonds, cut, price) %>%
+   ggplot(aes(cut, price, colour = cut)) +
+   stat_summary()
>
> layer_data(p, 1)
No summary function supplied, defaulting to `mean_se()`
colour x group        y     ymin     ymax PANEL flipped_aes size linetype shape fill alpha stroke
1 #440154FF 1     1 4358.758 4270.025 4447.491     1       FALSE  0.5        1    19   NA    NA      1
2 #3B528BFF 2     2 3928.864 3876.302 3981.426     1       FALSE  0.5        1    19   NA    NA      1
3 #21908CFF 3     3 3981.760 3945.953 4017.567     1       FALSE  0.5        1    19   NA    NA      1
4 #5DC863FF 4     4 4584.258 4547.223 4621.293     1       FALSE  0.5        1    19   NA    NA      1
5 #FDE725FF 5     5 3457.542 3431.600 3483.484     1       FALSE  0.5        1    19   NA    NA      1
``````

Then use with`mean_se()`Comparison of calculation results of function

``````> select(diamonds, cut, price) %>%
+   group_by(cut) %>%
+   summarise(mean_se(price))
# A tibble: 5 x 4
cut           y  ymin  ymax
* <ord>     <dbl> <dbl> <dbl>
1 Fair      4359. 4270. 4447.
2 Good      3929. 3876. 3981.
3 Very Good 3982. 3946. 4018.
5 Ideal     3458. 3432. 3483.
``````

As we can see,`y``ymin``ymax`The values of these three parameters are the same as`mean_se()`The calculated results are consistent

## use

Since we can define the transformation function, we can define our own statistical transformation, and we can make some personalized adjustments to the graph as needed.

`stat_summary()`Parameters of function`fun.data`You can specify a statistical transformation function, which defaults to`mean_se()`

`fun.data`The passed in function requires that the data frame be returned, and the data frame variable name is the attribute mapping parameter

Let’s draw some personalized pictures

### 1. 95% confidence interval error line

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, fill = cut)) +
stat_summary(geom = "bar") +
stat_summary(
geom = "errorbar", width = 0.5,
fun.data = ~mean_se(., mult = 1.96)
)
``````

be careful: we use`~`Symbols to construct anonymous functions, equivalent to

``````function(x) {mean_se(x, mult = 1.96)}
``````

### 2. Specify the fill color

We use the transformation function to set the color of the groups that meet the conditions, and separate the groups whose median value is greater than and less than the threshold with color

``````func_median_color <- function(x, cut_off) {
tibble(y = median(x)) %>%
mutate(fill = if_else(y < cut_off, "#80b1d3", "#fb8072"))
}

select(diamonds, cut, price) %>%
ggplot(aes(cut, price)) +
stat_summary(
fun.data = func_median_color,
fun.args = c(cut_off = 2800),
geom = "bar"
)
``````

We pass additional parameters to`fun.args`, the way to replace anonymous functions, that is, equivalent to

``````fun.data = ~ func_median_color(., cut_off = 2800)
``````

### 3. Set the size of points in the point line diagram

We set the size of the midpoint of the point line diagram according to the number of observations in the group

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
stat_summary(
fun.data = function(x) {
mean_se(x) %>%
mutate(size = length(x) * 5 / nrow(diamonds))
}
)
``````

## Python code reading (Chapter 59): query the dictionary key value according to value

Introduction to Python code reading collection:Why not recommend Python beginners to directly look at the project source code The code read in this article implementsvalueQuery dictionarykeyThe function of. The code snippet read in this article comes from30-seconds-of-python。 find_keys def find_keys(dict, val): return list(key for key, value in dict.items() if value == val) # EXAMPLES ages […]