R for Data Science (notes) — data sorting (sorting and merging)

Time:2022-5-16

R for Data Science (notes) -- data sorting (sorting and merging)

R for Data Science

R for Data Science

After thinking about it, I still want to make a note directory, which can facilitate query

R for Data Science (note) — data transformation (used by filter)
R for Data Science (note) — data transformation (select basic use)
R for Data Science (note) — data transformation (select combination of other functions)
R for Data Science (note) — data transformation (create new variable)
R for Data Science (note) — data transformation (row sorting)
R for Data Science (note) — data transformation (summary)
R for Data Science (note) — data collation (pivot correlation function)

The extensive use of tidy stream processing data, I think it has a very important relationship with the use of pipeline symbol% >% and the verb of data processing.

I call it efficient to solve the most important and common problems in the least time; The remaining difficulty, I call it improvement.

1. Column splitting operation separate() function

Separate () splits a column into multiple columns by splitting where the separator appears.

Sample data

table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

1.1 preliminary study on splitting

Now split the rate column into two columns with “/” as the separator

table3 %>% 
  separate(rate, into = c("cases", "population"))
#> # A tibble: 6 x 4
#>   country      year cases  population
#>   <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

The above code does not give the characteristic symbol of segmentation because,The separate() function defaults to non alphabetic and non numeric as separators。 Therefore, the above code treats “/” as a separator by default.

R for Data Science (notes) -- data sorting (sorting and merging)

Operation diagram

Of course, the above can also be written as

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

The parameter SEP is actuallyregular expression , the processing of this combined string will become extremely powerful.

1.2 conversion of content nature after splitting

Take a closer look at the default conversion above

R for Data Science (notes) -- data sorting (sorting and merging)

After the default conversion, the string type is used. For this data, we should get the number type. Of course, such conversion can be converted later. The separate() function can be completed directly, as follows:

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

2. Use of unite() function

If there is a split column operation, there is a merge column operation
Examples are as follows:

table5 %>% 
  unite(new, century, year)
#> # A tibble: 6 x 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 19_99 745/19987071     
#> 2 Afghanistan 20_00 2666/20595360    
#> 3 Brazil      19_99 37737/172006362  
#> 4 Brazil      20_00 80488/174504898  
#> 5 China       19_99 212258/1272915272
#> 6 China       20_00 213766/1280428583

In the default operation function, use “__ to connect the two columns, If you want to change the connection mode, the code is as follows:

table5 %>% 
  unite(new, century, year, sep = "")
#> # A tibble: 6 x 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

It’s also easy to understand. Add the SEP parameter in the corresponding place.

Recommended Today

IOS certificate related

First, the IOS development certificate is that developer, distribution and MAC each have a root certificate. All real machine test certificates issued are matched under the corresponding certificates. All certificates of each corresponding app on the shelf cannot be removed or revoked, which will lead to the unavailability of online applications. And the different distribution […]