# Data mining 20210112 learning notes

Time：2022-1-21

## 1、 Data cleaning

``````##Raw data
> test <- data.frame(geneid = paste0("gene",1:4),
+                  sample1 = c(1,4,7,10),
+                  sample2 = c(2,5,0.8,11),
+                  sample3 = c(0.3,6,9,12))``````

``````> test_gather <- gather(data = test,
+                     key = sample_nm,
+                     value = exp,
+                     - geneid)
``````

### Long flattening

``````> test_re <- spread(data = test_gather,
+                 key = sample_nm,
+                 value = exp)
``````

## 2、 Separate unite

``````#Raw data
> test <- data.frame(x = c( "a,b", "a,d", "b,c"))``````

### division

``````> test_seprate <- separate(test,x, c("X", "Y"),sep = ",")
``````

### merge

``````> test_re <- unite(test_seprate,"x",X,Y,sep = ",")
``````

## 3、 Processing Na

``````###Raw data
> X<-data.frame(X1 = LETTERS[1:5],X2 = 1:5)
> X[2,2] <- NA
> X[4,1] <- NA
> X
X1 X2
1    A  1
2    B NA
3    C  3
4 <NA>  4
5    E  5``````

### Remove the rows containing Na. You can choose to remove them only according to a certain column

``````> drop_ Na (x) # delete all lines with Na
X1 X2
1  A  1
2  C  3
3  E  5
> drop_ Na (x, x1) # only analyzes the column x1, and the row with Na in X1 is removed. Note that it is not an assignment, and X has not changed
X1 X2
1  A  1
2  B NA
3  C  3
4  E  5``````

### Replace Na

``````> replace_ Na (x \$x2,0) # change the Na value of x2 column to 0
 1 0 3 4 5``````

### Populate Na with the value of the previous line

``````> X
X1 X2
1    A  1
2    B NA
3    C  3
4 <NA>  4
5    E  5
>Fill (x, x2) #x2 the null value of this column is filled in according to the previous row
X1 X2
1    A  1
2    B  1
3    C  3
4 <NA>  4
5    E  5``````

For the full version, see the manuscripthttps://rstudio.com/resources/cheatsheets/

## Dplyr core function

#### Data preparation

``````> library(dplyr)
> test <- iris[c(1:2,51:52,101:102),]
> rownames(test) =NULL
``````

### 5. Summary (): summary

``````###1. Change(), add a new column
> mutate(test, new = Sepal.Length * Sepal.Width)``````
``````###2. Select() to filter by column
####(1) Filter by column number
>Select (test, 1) # filter the first column
>Select (test, C (1,5)) # filters the first and fifth columns

####(2) Filter by column name
> select(test,Sepal.Length)
> select(test, Petal.Length, Petal.Width)
> vars <- c("Petal.Length", "Petal.Width")
> select(test, one_of(vars))

> select(test, starts_with("Petal"))
> select(test, ends_with("Width"))
> select(test, contains("etal"))
> select(test, matches(".t."))
> select(test, everything())
> select(test, last_col())
> select(test, last_col(offset = 1))

####(3) With everything (), column names can be reordered
> select(test,Species,everything())``````
``````###3. Filter() filter rows
> filter(test, Species == "setosa")
> filter(test, Species == "setosa"&Sepal.Length > 5 )
> filter(test, Species %in% c("setosa","versicolor"))``````
``````###4. Arrange(), sort the entire table by a column
>Range (test, sepal. Length) # by default, it is sorted from small to large
>Range (test, desc (sepal. Length)) # from large to small with desc``````
``````###5. Summary (): summary
>Summarize (test, mean (sepal. Length), SD (sepal. Length)) calculates sepal Mean and standard deviation of length:
mean(Sepal.Length) sd(Sepal.Length)
1           5.916667        0.8084965

># first group according to specifications and calculate sepal of each group Mean and standard deviation of length
> group_by(test, Species)
> tmp = summarise(group_by(test, Species),mean(Sepal.Length), sd(Sepal.Length))
> tmp
# A tibble: 3 x 3
Species    `mean(Sepal.Length)` `sd(Sepal.Length)`
* <fct>                     <dbl>              <dbl>
1 setosa                     5                 0.141
2 versicolor                 6.7               0.424
3 virginica                  6.05              0.354``````

Supplementary usage of array() function

``````> library(dplyr)
>Test = iris [C (1,2,51,52101102),] # select lines 1,2,51,52101102 of iris
>Rownames (test) = null # remove the row name of iris
>Range (test, sepal. Length) # according to sepal Length sorts from small to large
>Range (test, desc (sepal. Length)) # according to sepal Length is sorted from large to small
>Array (test, sepal. Length, sepal. Width) # is sorted by two columns. If one column has the same value, it is sorted by the second column
>O = order (test \$sepal. Length) # return value is the position subscript
> test\$Sepal.Length[o]
 4.9 5.1 5.8 6.3 6.4 7.0
>X [order (x)] is equivalent to sort (x), but you can sort not only columns but also data frames with order
> test[o,]``````

### Two practical skills

##### 1: Pipeline operation% >% (CMD / Ctrl + Shift + m)
``````> library(dplyr)
> x1 = filter(iris,Sepal.Width>3)
> x2 = select(x1,c("Sepal.Length","Sepal.Width" ))
> x3 = arrange(x2,Sepal.Length)

> iris %>%
+   filter(Sepal.Width>3) %>%
+   select(c("Sepal.Length","Sepal.Width" ))%>%
+   arrange(Sepal.Length)
``````
##### 2: Count counts the unique value of a column
``````> count(test,Species)
Species n
1     setosa 2
2 versicolor 2
3  virginica 2
``````

### Processing relational data: connect two tables. Note: do not introduce factor

raw data

``````> options(stringsAsFactors = F)
> test1 <- data.frame(name = c('jimmy','nicker','doodle'),
+                     blood_type = c("A","B","O"))
> test1
name blood_type
1  jimmy          A
2 nicker          B
3 doodle          O
> test2 <- data.frame(name = c('doodle','jimmy','nicker','tony'),
+                     group = c("group1","group1","group2","group2"),
+                     vision = c(4.2,4.3,4.9,4.5))
> test2
name  group vision
1 doodle group1    4.2
2  jimmy group1    4.3
3 nicker group2    4.9
4   tony group2    4.5
> test3 <- data.frame(NAME = c('doodle','jimmy','lucy','nicker'),
+                     weight = c(140,145,110,138))
> test3
NAME weight
1 doodle    140
2  jimmy    145
3   lucy    110
4 nicker    138
``````
``````> merge(test1,test2,by="name")
name blood_type  group vision
1 doodle          O group1    4.2
2  jimmy          A group1    4.3
3 nicker          B group2    4.9
> merge(test1,test3,by.x = "name",by.y = "NAME")
name blood_type weight
1 doodle          O    140
2  jimmy          A    145
3 nicker          B    138
``````
##### 1. Inner_ Join, take intersection
``````> inner_join(test1, test2, by = "name")
name blood_type  group vision
1  jimmy          A group1    4.3
2 nicker          B group2    4.9
3 doodle          O group1    4.2
> inner_join(test1,test3,by = c("name"="NAME"))
name blood_type weight
1  jimmy          A    145
2 nicker          B    138
3 doodle          O    140
``````
``````> left_join(test1, test2, by = 'name')
name blood_type  group vision
1  jimmy          A group1    4.3
2 nicker          B group2    4.9
3 doodle          O group1    4.2
> left_join(test2, test1, by = 'name')
name  group vision blood_type
1 doodle group1    4.2          O
2  jimmy group1    4.3          A
3 nicker group2    4.9          B
4   tony group2    4.5       <NA>
``````
##### 3. Company wide full_ join
``````> full_join(test1, test2, by = 'name')
name blood_type  group vision
1  jimmy          A group1    4.3
2 nicker          B group2    4.9
3 doodle          O group1    4.2
4   tony       <NA> group2    4.5
``````
##### 4. Semi join: returns semi of all records in the X table that can match the Y table_ join
``````> semi_join(x = test1, y = test2, by = 'name')
name blood_type
1  jimmy          A
2 nicker          B
3 doodle          O
``````
##### 5. De join: returns the recorded anti in the X table that cannot match the Y table_ join
``````> anti_join(x = test2, y = test1, by = 'name')
name  group vision
1 tony group2    4.5
``````
##### 6. Simple consolidation of data

Cbind() function and rbind() function in the base package; Note that bind_ The rows () function requires the same number of columns in two tables, and bind_ The cols () function requires that the two data frames have the same number of rows

``````> test1 <- data.frame(x = c(1,2,3,4), y = c(10,20,30,40))
> test1
x  y
1 1 10
2 2 20
3 3 30
4 4 40
> test2 <- data.frame(x = c(5,6), y = c(50,60))
> test2
x  y
1 5 50
2 6 60
> test3 <- data.frame(z = c(100,200,300,400))
> test3
z
1 100
2 200
3 300
4 400
> bind_rows(test1, test2)
x  y
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
> bind_cols(test1, test3)
x  y   z
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
``````

## Stringr function

``````> library(stringr)
> x <- "The birch canoe slid on the smooth planks."
> x
 "The birch canoe slid on the smooth planks."
``````

### 1. Detect string length

``````> length(x)
 1
> str_ How many characters does length (x) # contain
 42``````

### 2. String splitting and combination

``````> str_ Split (x, "") # splits strings by spaces
[]
 "The"     "birch"   "canoe"   "slid"    "on"
 "the"     "smooth"  "planks."
> x2 = str_split(x," ")[]
> str_ C (X2, collapse = "") # by space combination
 "The birch canoe slid on the smooth planks."
> str_c(x2,1234,sep = "+")
 "The+1234"     "birch+1234"   "canoe+1234"
 "slid+1234"    "on+1234"      "the+1234"
 "smooth+1234"  "planks.+1234"``````

### 3. Extract part of the string

``````> str_ Sub (x, 5,9) # from 5th to 9th
 "birch"``````

### 4. Case conversion

``````> str_ to_ Upper (x2) # changes the string to uppercase
 "THE"     "BIRCH"   "CANOE"   "SLID"    "ON"
 "THE"     "SMOOTH"  "PLANKS."
> str_ to_ Lower (x2) # changes the string to lowercase
 "the"     "birch"   "canoe"   "slid"    "on"
 "the"     "smooth"  "planks."
> str_ to_ Title (x2) # change the initial to uppercase
 "The"     "Birch"   "Canoe"   "Slid"    "On"
 "The"     "Smooth"  "Planks."``````

### 5. String sorting

``````> str_sort(x2)
 "birch"   "canoe"   "on"      "planks." "slid"
 "smooth"  "the"     "The"
``````

### 6. Character detection — the return value is a logical value

``````> str_ Detect (X2, "H") # string contains "H"
  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
> str_ Starts (X2, "t") # string contains "t"
  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> str_ Ends (X2, "e") # string ends with e
  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

###With sum and mean, you can count the number and proportion of matches
> sum(str_detect(x2,"h"))
 4
> mean(str_detect(x2,"h"))
 0.5``````

### 7. Extract the matching string

``````> str_subset(x2,"h")
 "The"    "birch"  "the"    "smooth"
``````

### 8. Character count

`````` 7
> str_count(x2,"o")
 0 0 1 0 1 0 2 0
``````

### 9. String substitution

`````` "The"     "birch"   "canAe"   "slid"    "An"
 "the"     "smAoth"  "planks."
> str_ replace_ All (X2, "O", "a") # replace all o with a
 "The"     "birch"   "canAe"   "slid"    "An"
 "the"     "smAAth"  "planks."``````

practice

``````#Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community.
#1. Assign the above sentence as a long string to TMP
#2. Split it into a vector composed of words and assign it to tmp2 (pay attention to punctuation)
#3. Use the function to return the number of words in this sentence.
#4. Use the function to return how many letters each word in this sentence is composed of.
#5. Count tmp2 how many words contain the letter "e"

> tmp <- "Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community."

> tmp2 <- tmp %>%
+   str_ Replace (",", "")% >% # change comma to space
+   str_ remove("[.]")%>%      # Will Become its own meaning
+   str_split(" ")
> tmp2 <- tmp2[]

> length(tmp2)
 16

> str_length(tmp2)
 14  2  1  3  7  2  7  4 10  8  3 13  2  3  8  9

> table(str_detect(tmp2,"e"))

FALSE  TRUE
9     7
> sum(str_detect(tmp2,"e"))
 7
#str_ Count (tmp2, "e") # refers to the number of E in each string``````

## I Conditional statement

### 1.if(){ }

#### (1) If there is only if but no else, then nothing will be done when the condition is false

``````> i = -1
> if (i<0) print('up')
 "up"
>If (I > 0) print ('up ') # condition is false``````

#### (2) There is else

``````> i =1
> if (i>0){
+   cat('+')
+ } else {
+   print("-")
+ }
+# return value is+``````

#### Ifelse function

Ifelse has three parameters
ifelse(x,yes,no)
x: Is a logical value
Yes: the return value when the logical value is true
No: the return value when the logical value is false

``````> i=c(0.11548,-5.123,2.654)
> ifelse(i>0,"+","-")
 "+" "-" "+"

> x=rnorm(10)
> x
  0.6425792 -0.6829069  0.1632753 -0.2406404
 -0.3182894 -0.7686996 -0.1892211 -0.1442053
  1.0053013 -1.4639149
> y=ifelse(x>0,"+","-")
> y
 "+" "-" "+" "-" "-" "-" "-" "-" "+" "-"
``````

#### (3) Multiple conditions

``````> i = 0
> if (i>0){
+   print('+')
+ } else if (i==0) {
+   print('0')
+ } else if (i< 0){
+   print('-')
+ }
 "0"

> ifelse(i>0,"+",ifelse((i<0),"-","0"))
 "0"
``````

#### 2.switch()

``````> cd = 3
> foo <- switch(EXPR = cd,
+               #EXPR = "aa",
+               aa=c(3.4,1),
+               bb=matrix(1:4,2,2),
+               cc=matrix(c(T,T,F,T,F,F),3,2),
+               dd="string here",
+               ee=matrix(c("red","green","blue","yellow")))
> foo
[,1]  [,2]
[1,]  TRUE  TRUE
[2,]  TRUE FALSE
[3,] FALSE FALSE
``````

practice

``````#1. Use the loop to check the data types of "a", true and 3
> a <- list("a",TRUE,3)
> for (i in 1:length(a)) {
+   print(class(a[[i]]))
+
+ }
 "character"
 "logical"
 "numeric"``````
``````#2. Generate 10 random numbers and generate a new vector according to the 10 random numbers. The value of > median corresponds to "a" and the value of < median corresponds to "B".
> b <- rnorm(10)
> ifelse(b>median(b),"A","B")
 "A" "B" "A" "B" "B" "A" "B" "B" "A" "A"``````
``````#3. Generate a new vector according to tmp2 in the previous exercise. The value with e corresponds to "a", and the value without e corresponds to "B"
> tmp2 <-  tmp %>%
+   str_replace(","," ") %>%
+   str_remove("[.]") %>%
+   str_split(" ")
> tmp2
[]
 "Bioinformatics" "is"
 "a"              "new"
 "subject"        "of"
 "genetic"        "data"
 "collection"     "analysis"
 "and"            "dissemination"
 "to"             "the"
 "research"       "community"
> tmp2 <- tmp2[]
> ifelse(str_detect(tmp2,"e"),"A","B")
 "B" "B" "B" "A" "A" "B" "A" "B" "A" "B" "B" "A"
 "B" "A" "A" "B"``````
``````#4. Load deg.rdata and generate vector x according to the values of columns a and B according to the following conditions:
#If a < 1 and B < 0.05, the corresponding value of X is down;
#a> 1 and B < 0.05, then the corresponding value of X is up;
#In other cases, the value of X is No
> k1 = deg\$a<1 & deg\$b<0.05
> k2 = deg\$a>1 & deg\$b<0.05
> x = ifelse(k1,"down",ifelse(k2,"up","no"))``````
``````# 5. Count the number of duplicate values of X
> table(x)
x
down    no    up
3828 26094   853``````
``````# 6. Add x to the DEG data frame as a new column
> deg\$x <- x``````

## 2、 Circular statement

### 1. For loop

``````> x <- c(5,6,0,3)
> s=0
> for (i in x){
+   s=s+i
+   #if(i == 0) next
+   #if (i == 0) break
+   print(c(which(x==i),i,1/i,s))
+ }
 1.0 5.0 0.2 5.0
  2.0000000  6.0000000  0.1666667 11.0000000
   3   0 Inf  11
  4.0000000  3.0000000  0.3333333 14.0000000

> x <- c(5,6,0,3)
> s=0
> for (i in x){
+   s=s+i
+   if(i == 0) next
+   #if (i == 0) break
+   print(c(which(x==i),i,1/i,s))
+ }
 1.0 5.0 0.2 5.0
  2.0000000  6.0000000  0.1666667 11.0000000
  4.0000000  3.0000000  0.3333333 14.0000000

> x <- c(5,6,0,3)
> s=0
> for (i in x){
+   s=s+i
+   #if(i == 0) next
+   if (i == 0) break
+   print(c(which(x==i),i,1/i,s))
+ }
 1.0 5.0 0.2 5.0
  2.0000000  6.0000000  0.1666667 11.0000000
``````
``````#How to save the results?
> s = 0
> result = list()
> for(i in 1:length(x)){
+   s=s+x[[i]]
+   result[[i]] = c(i,x[[i]],1/i,s)
+ }
> do.call(cbind,result)
[,1] [,2]       [,3]  [,4]
[1,]    1  2.0  3.0000000  4.00
[2,]    5  6.0  0.0000000  3.00
[3,]    1  0.5  0.3333333  0.25
[4,]    5 11.0 11.0000000 14.00``````
``````#Exercise 4----
#1. Use the loop to plot the 1 to 4 columns of iris respectively
>The par (mfrow = C (2,2)) #par () function can divide the drawing area into several regular parts, and draw by row first and mfcol by column first
> for(i in 1:4){
+   plot(iris[,i],col = iris[,5])
+ }

#2. Generate a matrix with 10 rows and 6 columns composed of random numbers (rnorm), and the column names are sample1, sample2 Sample6. The row names are gene1, Gene2... Gene10. Sample1, 2 and 3 belong to group A, and sample4, 5 and 6 belong to group B. Draw a ggplot2 box diagram for each gene with a loop, and try to put 10 pictures together.
> exp = matrix(rnorm(60),nrow = 10)
> colnames(exp) <- paste0("sample",1:6)
> rownames(exp) <- paste0("gene",1:10)
> exp[1:4,1:4]
sample1     sample2     sample3      sample4
gene1  0.3756800 -0.35824521  0.04884076  0.004333555
gene2  1.3406486  1.29023800 -0.18444678 -0.379581765
gene3 -0.2858732 -0.03525992  0.46980022  0.582935510
gene4 -1.2478246 -0.47409951 -0.72981205  1.374565803
> #dat = cbind(t(exp),group = rep(c("A","B"),each = 3))
> dat = data.frame(t(exp))
> dat = mutate(dat,group = rep(c("A","B"),each = 3))
> p = list()
> library(ggplot2)
> for(i in 1:(ncol(dat)-1)){
+P [[i]] = ggplot (data = DAT, aes_string (x = "group", y = colnames (DAT) [i]) + # AES is required for batch plotting_ String(), loop of character vector
+     geom_boxplot(aes(color = group))+
+     geom_jitter(aes(color = group))+
+     theme_bw()
+ }
> library(patchwork)
> wrap_plots(p,nrow = 2,guides = "collect")``````

### 2. While loop

``````> i = 0
> while (i < 5){
+   print(c(i,i^2))
+   i = i+1
+ }
 0 0
 1 1
 2 4
 3 9
  4 16
``````

### 3. Repeat statement

Note: there must be a break

``````> i=0L
> s=0L
> repeat{
+  i = i + 1
+  s = s + i
+  print(c(i,s))
+  if(i==10) break
+ }
 1 1
 2 3
 3 6
  4 10
  5 15
  6 21
  7 28
  8 36
  9 45
 10 55
``````

### Apply() family functions

#### 1. Apply processing matrix or data frame

##### Margin is 1 for row fetching, 2 for column fetching, and fun is a function
``````> test<- iris[,1:4]
> apply(test, 2, mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
5.843333     3.057333     3.758000     1.199333
> apply(test, 1, sum)
 10.2  9.5  9.4  9.4 10.2 11.4  9.7 10.1  8.9
  9.6 10.8 10.0  9.3  8.5 11.2 12.0 11.0 10.3
 11.5 10.7 10.7 10.7  9.4 10.6 10.3  9.8 10.4
 10.4 10.2  9.7  9.7 10.7 10.9 11.3  9.7  9.6
 10.5 10.0  8.9 10.2 10.1  8.4  9.1 10.7 11.2
  9.5 10.7  9.4 10.7  9.9 16.3 15.6 16.4 13.1
 15.4 14.3 15.9 11.6 15.4 13.2 11.5 14.6 13.2
 15.1 13.4 15.6 14.6 13.6 14.4 13.1 15.7 14.2
 15.2 14.8 14.9 15.4 15.8 16.4 14.9 12.8 12.8
 12.6 13.6 15.4 14.4 15.5 16.0 14.3 14.0 13.3
 13.7 15.1 13.6 11.6 13.8 14.1 14.1 14.7 11.7
 13.9 18.1 15.5 18.1 16.6 17.5 19.3 13.6 18.3
 16.8 19.4 16.8 16.3 17.4 15.2 16.1 17.2 16.8
 20.4 19.5 14.7 18.1 15.3 19.2 15.7 17.8 18.2
 15.6 15.8 16.9 17.6 18.2 20.1 17.0 15.7 15.7
 19.1 17.7 16.8 15.6 17.5 17.8 17.4 15.5 18.2
 18.2 17.2 15.7 16.7 17.3 15.8
> res <- c()
> for(i in 1:nrow(test)){
+   res[[i]] <- sum(test[i,])
+ }
``````

#### 2.lapply(list, FUN, …)

###### Perform the same operation on each element (vector) in the list / vector
``````> test <- list(x = 36:33,
+              y = 32:35,
+              z = 30:27)
``````
##### The return value is the list. Find the mean value of each element (vector) in the list (try variance VaR, quantile)
``````> lapply(test,mean)
\$x
 34.5

\$y
 33.5

\$z
 28.5

> class(lapply(test,mean))
 "list"
> x <- unlist(lapply(test,mean));x
x    y    z
34.5 33.5 28.5
> class(x)
 "numeric"
``````

#### 3. Sapply process the list, simplify the results, and directly return the matrix and vector

###### Note the difference between happy (x, fun,…) and happy. The return value is different
``````> lapply(test,min)
\$x
 33

\$y
 32

\$z
 27

> sapply(test,min)
x  y  z
33 32 27
> lapply(test,range)
\$x
 33 36

\$y
 32 35

\$z
 27 30

> sapply(test,range)
x  y  z
[1,] 33 32 27
[2,] 36 35 30
> class(sapply(test,range))
 "matrix" "array"
``````

## Modify user information changeinfo

When judging the persistence layer: Problem: there is such a problem when modifying user information. For example: the user’s email is not required. It was not empty originally. At this time, the user deletes the mailbox information and submits it. At this time, if it is not empty to judge whether it needs to be […]