R data processing (XIII) — tidyr

Time:2022-1-2

1. Preface

In this section, we will introduce the clean way of data, andtidyverseOne of the core packages oftidyr

It provides a series of tools to help sort out chaotic data, which is also the focus of this section.

library(tidyverse)

2. Clean data

Each data can be expressed in different ways. In the following example, we show the same data in four different ways.

Each data includescountry, year, populationandcases 4Variables have the same value, but they organize data differently.

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Spread across two tibbles
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

These are all representations of the same basic data, but they are not all easy to use.

There are three interrelated rules for data neatness

  1. Each variable must have its own column
  2. Each observation must be in its own row
  3. Each value must be in its own cell

The following picture also shows these three rules

R data processing (XIII) -- tidyr

image.png

These three rules are interrelated, and it is impossible to satisfy only two of them. This interrelationship leads to a simpler and practical set of instructions

  1. Save each dataset astibble
  2. Each variable represents a column

In the above example, onlytable1Each variable is expressed as a column

Why make sure the data is clean? There are two main advantages

  1. There are advantages to storing data in a consistent way. If your data structure is consistent, it is easier to learn and use these tool functions

  2. Each variable as a column has an advantage that it can give full play toRVectorization characteristics of.RMost built-in functions can be used with value vectors, which makes the conversion of data particularly natural

dplyrggplot2andtidyverseAll other packages in are designed to handle clean data. Here are some small examples to show how to usetable1

> table1 %>% 
+     mutate(rate = cases / population * 10000)
# A tibble: 6 x 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

> table1 %>% 
+     count(year, wt = cases)
# A tibble: 2 x 2
   year      n
  <dbl>  <dbl>
1  1999 250740
2  2000 296920

> ggplot(table1, aes(year, cases)) + 
+     geom_line(aes(group = country), colour = "grey50") + 
+     geom_point(aes(colour = country))
R data processing (XIII) -- tidyr

2.1 thinking exercises

  1. calculationtable2andtable4a + table4bRatio ofrate, you may need to perform the following4Operations
  • The number of TB cases in each country is extracted every year.
  • Extract the matching population per country / region per year
  • Divide the number of cases by the number of people and multiply by10000
  • Then put the ratio in the right place

Which representation is the easiest to use? Which is the hardest? Why?

  1. usetable2replacetable1Recreate the graph of cases over time. What do you need to do first?

3. Pivot table

Usually in real analysis, the data we encounter need to be further sorted out.

The first step is to figure out what variables and observations are. Sometimes this is easy, but sometimes you may need to ask the data provider

The second step is to solve the following two common problems:

  1. A variable may be distributed over multiple columns
  2. An observation may be scattered in multiple rows

To solve these problems, you can usetidyrThe two most important functions provided are:pivot_longer()andpivot_wider()

3.1 Longer

A common problem is that the column names in the dataset are not variable names, but variable values

Like a watchtable4a, its column name1999and2000Represents the value of the year variable, and the values of these two columns represent the corresponding yearcasesThe value of the variable, and each line is the value of two variables, not one.

> table4a
# A tibble: 3 x 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

To organize such a dataset, we need to pivot the problematic column into a new pair of variables.

To describe this operation, we need three parameters:

  • Column names are values, not variables, in this case, yes1999and2000column
  • The variable name to move the column name to, in this caseyear
  • The name of the variable to which you want to move the value of the column. Here it iscases

Pass these parameters topivot_longer()In function

> table4a %>% 
+     pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Column names can be useddplyr::select()Style representation assignment. But there are only two columns here, so they are listed separately

be careful1999and2000Yes and no grammatical names (because they don’t start with letters), so we have to tick them back

becausetable4aDoes not exist inyearandcasesColumn name, so we enclose its name in quotation marks.

R data processing (XIII) -- tidyr

image.png

In the final result, we get two new columns and delete the two columns that need perspective.

pivot_longer()By increasing the number of rows and reducing the number of columns, the data set becomes longer. Of course, the length here is mainly relative to the original data.

We can usepivot_longer()Organize in a similar waytable4b, the only difference is the value of the variable stored in the cell

> table4b %>% 
+     pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
# A tibble: 6 x 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583

To put the sortedtable4aandtable4bMerge into onetibleIn, we can usedplyr::left_join()Connect

> tidy4a <- table4a %>% 
+     pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
> tidy4b <- table4b %>% 
+     pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
> left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
# A tibble: 6 x 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <int>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

3.2 Wider

pivot_wilder()Andpivot_longer()Instead, it can be used when observations are scattered across multiple rows.

For example, intable2intypeObservations observations are distributed in two rows of country and year

> table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

We start with something likepivot_longer()Analysis of the way. However, we only need two parameters this time

  • The column from which to get the variable name istype
  • The column from which to get the value, in this casecount

Once we understand this, we can use itpivot_wilder()

> table2 %>%
+     pivot_wider(names_from = type, values_from = count)
# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
R data processing (XIII) -- tidyr

image.png

Maybe you’ve guessed from their names,pivot_wilder()andpivot_longer()They are complementary.

pivot_longer()Narrow and lengthen the wide table;pivot_wider()Make the long watch shorter and wider

3.3 thinking practice

  1. Why?pivot_wilder()andpivot_longer()Incomplete symmetry? Think carefully about the following example.
stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")

Tips: note the variable type and consider the column name

pivot_longer()There is onenames_ptypesParameters, for examplenames_ptypes=list(year=double)。 What is its function?

  1. Why is the following code wrong?
> table4a %>% 
+     pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
Error: can't subset columns that don't exist
x Locations 1999 and 2000 don't exist.
ℹ There are only 3 columns.
Run `rlang::last_error()` to see where the error occurred.
  1. What happens if you want to widen the table? Why? How do you add new columns to uniquely identify each value?
people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)
  1. Organize the following simple table. Do you need to make the watch wider or longer? What should the variable be set to?
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

Recommended Today

Vue2 technology finishing 3 – Advanced chapter – update completed

3. Advanced chapter preface Links to basic chapters:https://www.cnblogs.com/xiegongzi/p/15782921.html Link to component development:https://www.cnblogs.com/xiegongzi/p/15823605.html 3.1. Custom events of components 3.1.1. Binding custom events There are two implementation methods here: one is to use v-on with vuecomponent$ Emit implementation [PS: this method is a little similar to passing from child to parent]; The other is to use ref […]