R data visualization – ggplot statistical layer

Time:2022-5-9

preface

Although we introduced so many sectionsggplot2, we basically use when drawing layersgeom_*()Function, but rarely usedstat_*()Function.

Of course, usegeom_*()Function can already complete most of the drawing work, so it is necessary to use itstat_*()Function?

Let’s take a look at an example. Suppose there are the following data

> select(diamonds, cut, price)
# A tibble: 53,940 x 2
   cut       price
   <ord>     <int>
 1 Ideal       326
 2 Premium     326
 3 Good        327
 4 Premium     334
 5 Good        335
 6 Very Good   336
 7 Very Good   336
 8 Very Good   337
 9 Fair        337
10 Very Good   338
# … with 53,930 more rows

We want to draw a histogram to show the average price of each cut.

The conventional method is to usetidyverseTo sort out the data, then calculate the required statistical values and map them to the corresponding graphic attributes, that is

select(diamonds, cut, price) %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    .groups = "drop"
  ) %>%
  ggplot(aes(cut, mean_price, fill = cut)) +
  geom_col()
R data visualization - ggplot statistical layer

Now, we are not satisfied with this. Now, we want to add error bars to the histogram

Of course, this is also very simple. We can make statistical calculation on the data and then draw

select(diamonds, cut, price) %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    .groups = "drop",
    se = sqrt(var(price)/length(price))
  ) %>%
  mutate(lower = mean_price - se, upper = mean_price + se) %>%
  ggplot(aes(cut, mean_price, fill = cut)) +
  geom_col() +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.5)
R data visualization - ggplot statistical layer

R data visualization - ggplot statistical layer

en..., in order to draw such a simple picture, the code we write is longer than the picture.

Because our idea is still to prepare the data first, and then map the data to graphic attributes.

This leads to the need for a lot of statistical calculations on the data, which is not in line with the neat way of data.

We can think like this. Since all statistical information comes from the same data, why don’t we directly transfer the data toggplot, let the statistical calculation of data be carried out internally?

We can rewrite it like this

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, fill = cut)) +
  stat_summary(geom = "bar") +
  stat_summary(geom = "errorbar", width = 0.5)
R data visualization - ggplot statistical layer

R data visualization - ggplot statistical layer

Two lines of code can be done. Why do you have to write so much? It’s good to save time and have a cup of tea.

Principle analysis

Learning and understandingstat_summaryFunction works, so what elsestat_*Functions are easy to understand.

How do we understand thatstat_summaryAnd? Let’s take an example

Using the above data, we draw the point diagram of cutting and price

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  geom_point()
R data visualization - ggplot statistical layer

Then use the without parametersstat_summaryTo replacegeom_pointSee what happens

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  stat_summary()
R data visualization - ggplot statistical layer

Drawn ispointrangeObject.

Let’s see firststat_summaryfunction

stat_summary(
  mapping = NULL,
  data = NULL,
  geom = "pointrange",
  position = "identity",
  ...,
  fun.data = NULL,
  fun = NULL,
  fun.max = NULL,
  fun.min = NULL,
  fun.args = list(),
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  fun.y,
  fun.ymin,
  fun.ymax
)

The default drawing ispointrange, thatpointrangeWhat attribute mappings need to be defined?

  • xory
  • yminorxmin
  • ymaxorxmax

However, we have no definitionyminymax, that should bestat_summaryThe corresponding value is calculated and passed topointrange

How to verify our conjecture? First, we see that running the above code will output a warning message

No summary function supplied, defaulting to `mean_se()`

By default, i.emean_se()Function transformation

Let’s seemean_se()What did you do

> mean_se
function (x, mult = 1) 
{
    x <- stats::na.omit(x)
    se <- mult * sqrt(stats::var(x)/length(x))
    mean <- mean(x)
    new_data_frame(list(y = mean, ymin = mean - se, ymax = mean + 
        se), n = 1)
}
<bytecode: 0x7fca56dfa5d0>
<environment: namespace:ggplot2>

We can see that the data frame returned by this function contains three values, exactlypointrangeParameters to be passed in

We can use Layer_data()Function to extract the data used in the layer

> p <- select(diamonds, cut, price) %>%
+   ggplot(aes(cut, price, colour = cut)) +
+   stat_summary()
>
> layer_data(p, 1)
No summary function supplied, defaulting to `mean_se()`
     colour x group        y     ymin     ymax PANEL flipped_aes size linetype shape fill alpha stroke
1 #440154FF 1     1 4358.758 4270.025 4447.491     1       FALSE  0.5        1    19   NA    NA      1
2 #3B528BFF 2     2 3928.864 3876.302 3981.426     1       FALSE  0.5        1    19   NA    NA      1
3 #21908CFF 3     3 3981.760 3945.953 4017.567     1       FALSE  0.5        1    19   NA    NA      1
4 #5DC863FF 4     4 4584.258 4547.223 4621.293     1       FALSE  0.5        1    19   NA    NA      1
5 #FDE725FF 5     5 3457.542 3431.600 3483.484     1       FALSE  0.5        1    19   NA    NA      1

Then use withmean_se()Comparison of calculation results of function

> select(diamonds, cut, price) %>%
+   group_by(cut) %>%
+   summarise(mean_se(price))
# A tibble: 5 x 4
  cut           y  ymin  ymax
* <ord>     <dbl> <dbl> <dbl>
1 Fair      4359. 4270. 4447.
2 Good      3929. 3876. 3981.
3 Very Good 3982. 3946. 4018.
4 Premium   4584. 4547. 4621.
5 Ideal     3458. 3432. 3483.

As we can see,yyminymaxThe values of these three parameters are the same asmean_se()The calculated results are consistent

R data visualization - ggplot statistical layer

use

Since we can define the transformation function, we can define our own statistical transformation, and we can make some personalized adjustments to the graph as needed.

stat_summary()Parameters of functionfun.dataYou can specify a statistical transformation function, which defaults tomean_se()

fun.dataThe passed in function requires that the data frame be returned, and the data frame variable name is the attribute mapping parameter

Let’s draw some personalized pictures

1. 95% confidence interval error line

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, fill = cut)) +
  stat_summary(geom = "bar") +
  stat_summary(
    geom = "errorbar", width = 0.5,
    fun.data = ~mean_se(., mult = 1.96)
  )
R data visualization - ggplot statistical layer

be careful: we use~Symbols to construct anonymous functions, equivalent to

function(x) {mean_se(x, mult = 1.96)}

2. Specify the fill color

We use the transformation function to set the color of the groups that meet the conditions, and separate the groups whose median value is greater than and less than the threshold with color

func_median_color <- function(x, cut_off) {
  tibble(y = median(x)) %>%
    mutate(fill = if_else(y < cut_off, "#80b1d3", "#fb8072"))
}

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price)) +
  stat_summary(
    fun.data = func_median_color,
    fun.args = c(cut_off = 2800),
    geom = "bar"
  )
R data visualization - ggplot statistical layer

We pass additional parameters tofun.args, the way to replace anonymous functions, that is, equivalent to

fun.data = ~ func_median_color(., cut_off = 2800)

3. Set the size of points in the point line diagram

We set the size of the midpoint of the point line diagram according to the number of observations in the group

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  stat_summary(
    fun.data = function(x) {
      mean_se(x) %>%
        mutate(size = length(x) * 5 / nrow(diamonds))
    }
  )
R data visualization - ggplot statistical layer

Recommended Today

Python code reading (Chapter 59): query the dictionary key value according to value

Introduction to Python code reading collection:Why not recommend Python beginners to directly look at the project source code The code read in this article implementsvalueQuery dictionarykeyThe function of. The code snippet read in this article comes from30-seconds-of-python。 find_keys def find_keys(dict, val): return list(key for key, value in dict.items() if value == val) # EXAMPLES ages […]