preface
Although we introduced so many sectionsggplot2
, we basically use when drawing layersgeom_*()
Function, but rarely usedstat_*()
Function.
Of course, usegeom_*()
Function can already complete most of the drawing work, so it is necessary to use itstat_*()
Function?
Let’s take a look at an example. Suppose there are the following data
> select(diamonds, cut, price)
# A tibble: 53,940 x 2
cut price
<ord> <int>
1 Ideal 326
2 Premium 326
3 Good 327
4 Premium 334
5 Good 335
6 Very Good 336
7 Very Good 336
8 Very Good 337
9 Fair 337
10 Very Good 338
# … with 53,930 more rows
We want to draw a histogram to show the average price of each cut.
The conventional method is to usetidyverse
To sort out the data, then calculate the required statistical values and map them to the corresponding graphic attributes, that is
select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(
mean_price = mean(price),
.groups = "drop"
) %>%
ggplot(aes(cut, mean_price, fill = cut)) +
geom_col()

Now, we are not satisfied with this. Now, we want to add error bars to the histogram
Of course, this is also very simple. We can make statistical calculation on the data and then draw
select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(
mean_price = mean(price),
.groups = "drop",
se = sqrt(var(price)/length(price))
) %>%
mutate(lower = mean_price - se, upper = mean_price + se) %>%
ggplot(aes(cut, mean_price, fill = cut)) +
geom_col() +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.5)


en...
, in order to draw such a simple picture, the code we write is longer than the picture.
Because our idea is still to prepare the data first, and then map the data to graphic attributes.
This leads to the need for a lot of statistical calculations on the data, which is not in line with the neat way of data.
We can think like this. Since all statistical information comes from the same data, why don’t we directly transfer the data toggplot
, let the statistical calculation of data be carried out internally?
We can rewrite it like this
select(diamonds, cut, price) %>%
ggplot(aes(cut, price, fill = cut)) +
stat_summary(geom = "bar") +
stat_summary(geom = "errorbar", width = 0.5)


Two lines of code can be done. Why do you have to write so much? It’s good to save time and have a cup of tea.
Principle analysis
Learning and understandingstat_summary
Function works, so what elsestat_*
Functions are easy to understand.
How do we understand thatstat_summary
And? Let’s take an example
Using the above data, we draw the point diagram of cutting and price
select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
geom_point()

Then use the without parametersstat_summary
To replacegeom_point
See what happens
select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
stat_summary()

Drawn ispointrange
Object.
Let’s see firststat_summary
function
stat_summary(
mapping = NULL,
data = NULL,
geom = "pointrange",
position = "identity",
...,
fun.data = NULL,
fun = NULL,
fun.max = NULL,
fun.min = NULL,
fun.args = list(),
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE,
fun.y,
fun.ymin,
fun.ymax
)
The default drawing ispointrange
, thatpointrange
What attribute mappings need to be defined?
-
x
ory
-
ymin
orxmin
-
ymax
orxmax
However, we have no definitionymin
、ymax
, that should bestat_summary
The corresponding value is calculated and passed topointrange
How to verify our conjecture? First, we see that running the above code will output a warning message
No summary function supplied, defaulting to `mean_se()`
By default, i.emean_se()
Function transformation
Let’s seemean_se()
What did you do
> mean_se
function (x, mult = 1)
{
x <- stats::na.omit(x)
se <- mult * sqrt(stats::var(x)/length(x))
mean <- mean(x)
new_data_frame(list(y = mean, ymin = mean - se, ymax = mean +
se), n = 1)
}
<bytecode: 0x7fca56dfa5d0>
<environment: namespace:ggplot2>
We can see that the data frame returned by this function contains three values, exactlypointrange
Parameters to be passed in
We can use Layer_data()
Function to extract the data used in the layer
> p <- select(diamonds, cut, price) %>%
+ ggplot(aes(cut, price, colour = cut)) +
+ stat_summary()
>
> layer_data(p, 1)
No summary function supplied, defaulting to `mean_se()`
colour x group y ymin ymax PANEL flipped_aes size linetype shape fill alpha stroke
1 #440154FF 1 1 4358.758 4270.025 4447.491 1 FALSE 0.5 1 19 NA NA 1
2 #3B528BFF 2 2 3928.864 3876.302 3981.426 1 FALSE 0.5 1 19 NA NA 1
3 #21908CFF 3 3 3981.760 3945.953 4017.567 1 FALSE 0.5 1 19 NA NA 1
4 #5DC863FF 4 4 4584.258 4547.223 4621.293 1 FALSE 0.5 1 19 NA NA 1
5 #FDE725FF 5 5 3457.542 3431.600 3483.484 1 FALSE 0.5 1 19 NA NA 1
Then use withmean_se()
Comparison of calculation results of function
> select(diamonds, cut, price) %>%
+ group_by(cut) %>%
+ summarise(mean_se(price))
# A tibble: 5 x 4
cut y ymin ymax
* <ord> <dbl> <dbl> <dbl>
1 Fair 4359. 4270. 4447.
2 Good 3929. 3876. 3981.
3 Very Good 3982. 3946. 4018.
4 Premium 4584. 4547. 4621.
5 Ideal 3458. 3432. 3483.
As we can see,y
、ymin
、 ymax
The values of these three parameters are the same asmean_se()
The calculated results are consistent

use
Since we can define the transformation function, we can define our own statistical transformation, and we can make some personalized adjustments to the graph as needed.
stat_summary()
Parameters of functionfun.data
You can specify a statistical transformation function, which defaults tomean_se()
fun.data
The passed in function requires that the data frame be returned, and the data frame variable name is the attribute mapping parameter
Let’s draw some personalized pictures
1. 95% confidence interval error line
select(diamonds, cut, price) %>%
ggplot(aes(cut, price, fill = cut)) +
stat_summary(geom = "bar") +
stat_summary(
geom = "errorbar", width = 0.5,
fun.data = ~mean_se(., mult = 1.96)
)

be careful: we use~
Symbols to construct anonymous functions, equivalent to
function(x) {mean_se(x, mult = 1.96)}
2. Specify the fill color
We use the transformation function to set the color of the groups that meet the conditions, and separate the groups whose median value is greater than and less than the threshold with color
func_median_color <- function(x, cut_off) {
tibble(y = median(x)) %>%
mutate(fill = if_else(y < cut_off, "#80b1d3", "#fb8072"))
}
select(diamonds, cut, price) %>%
ggplot(aes(cut, price)) +
stat_summary(
fun.data = func_median_color,
fun.args = c(cut_off = 2800),
geom = "bar"
)

We pass additional parameters tofun.args
, the way to replace anonymous functions, that is, equivalent to
fun.data = ~ func_median_color(., cut_off = 2800)
3. Set the size of points in the point line diagram
We set the size of the midpoint of the point line diagram according to the number of observations in the group
select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
stat_summary(
fun.data = function(x) {
mean_se(x) %>%
mutate(size = length(x) * 5 / nrow(diamonds))
}
)
