12.1 analysis of language differences and geo code sets Data download and sorting

Time:2022-5-3

1、 Example introduction

In this section, Download gse1009 data set and use limma package for difference analysis.

GSE1009  

Sample size:There were 6 samples in total, of which 3 were glomerular samples of diabetes nephropathy (DN) and the other 3 were normal glomerular samples.

Use chip:Affymetrix Human Genome U95 Version 2 Array。

Platform:GPL8300。

2、 Examples of variance analysis:

###Step 1: geo data download

>setwd(“D:\\Rfile“)  

#Set the working directory as the rfile folder of disk D. you can set the working directory according to your needs.

>rm(list = ls()) 

#Clear all variables

>options(stringsAsFactors=F)

#R is not automatically converted to factor variables when importing data

##Next, you need to install the geoquery package to download the files of the geo database. The installed geoquery package does not need to be installed repeatedly. You can directly call the geoquery package.

#The installation command is as follows (I have installed it, so I only give the installation command, and remove the previous comment symbol to install it):

#if (!requireNamespace(“BiocManager”, quietly = TRUE))

#install.packages(“BiocManager”)

#BiocManager::install(“GEOquery”)

#The installation command of the package in Bioconductor is often updated. If you are afraid that the installation command has been updated and the current command cannot be run, you can enter the home page of Baidu “geoquery Bioconductor”, drop it down to the installation command, and copy the latest installation command on the website. This is the latest installation command. The installation of packages in other bioconductors is the same. The details are as follows:

12.1 analysis of language differences and geo code sets Data download and sorting

12.1 analysis of language differences and geo code sets Data download and sorting

Just copy the red paragraph above.

##The following is to download data with geoquery package

>library(GEOquery) 

#Load geoquery package

>gset <- getGEO(“GSE1009”,destdir = “.”,AnnotGPL = F,getGPL = F)                

#Download gse1009 expression matrix file and assign it to GSET variable, destdir = “” Indicates where the downloaded file is placed. The default is the current working directory. Annotgpl = f means that the annotation file is not downloaded, and getgpl = f means that the platform file is not downloaded. In order to download speed, we set it not to download platform files and comment files.

#After downloading, open disk D and you can see a file “gse1009_series_matrix. TXT. GZ” in rfile.

>save(gset,file = ‘GSE1009.gset.Rdata’)

#Save the GSET (that is, the downloaded gse1009 matrix file) as an R file named “gse1009.gset” for future calls.

>gset[[“GSE1009_series_matrix.txt.gz”]]@annotation

#View gse1009_ series_ Platform file of matrix matrix

>gpl <- getGEO(‘GPL8300’, destdir=”.”)

#Download this platform file. Used for later gene symbol annotation.

Generally, let’s take a look at the data download code above:

12.1 analysis of language differences and geo code sets Data download and sorting

Code operation process and results:

12.1 analysis of language differences and geo code sets Data download and sorting

Files in rfile after running code:

12.1 analysis of language differences and geo code sets Data download and sorting

###Step 2: data sorting (assign probe ID to gene symbol)

>a=gset[[1]]

#Extract the first element in GSET (including gene expression matrix and annotation information) and assign it to variable a

>dat=exprs(a)

#Extract the expression matrix in a and assign it to dat. At this time, the gene name of the expression matrix is probe ID

>head(dat)

#Show the first 6 lines of the expression matrix to see whether the data has been log converted. Generally, the data is within 20, which has been log converted. If there are hundreds or thousands of data, it means that the data needs further log processing.

12.1 analysis of language differences and geo code sets Data download and sorting

>ex <- dat

qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))

LogC <- (qx[5] > 100) ||

  (qx[6]-qx[1] > 50 && qx[2] > 0) ||

  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)

if (LogC) { ex[which(ex <= 0)] <- NA

dat <- log2(ex)

print(“log2 transform finished”)}else{print(“log2 transform not needed”)}

#You can also enter the above code. If the data has been logged, log2 transform not needed will be displayed; If the data does not have a full line log, the log program needs to convert the full line log according to the code, and log2 transform finished is displayed after completion.

12.1 analysis of language differences and geo code sets Data download and sorting

Take another look at the converted data:

12.1 analysis of language differences and geo code sets Data download and sorting

Not thousands or hundreds.

>pd=pData(a)

#View the clinical information of a and prepare for the variables selected for grouping later

>View(pd)

#View PD

12.1 analysis of language differences and geo code sets Data download and sorting

In this example, PD has 26 variables (not fully shown). We can see that only the fields under the title can be divided into normal and disease, and the fields under other variables are the same, so we choose the title field for subsequent grouping.

The record in the title variable is control 1A Diabetes 1a……., The middle is divided by spaces. We need to extract control and diabetes as groups, so we need to use character segmentation package to realize it.

##Install the character segmentation package stringr package. The command is as follows. Those that have been installed can be called directly without repeated installation.

#install.packages(“stringr”)

>library(stringr)

#Call stringr package

>group_list=str_split(pd$title,’ ‘,simplify = T)[,1]

#Divide the field of the title variable in PD according to the space. In order to show you, we first run the segmentation code. After segmentation, as shown in the figure below, so we choose the first column after segmentation as the grouping status

12.1 analysis of language differences and geo code sets Data download and sorting

>table(group_list)

#Count the number of each group

12.1 analysis of language differences and geo code sets Data download and sorting

>gpl1<-Table(gpl)

>save(gpl1,file = ‘gpl1.Rdata’)

#We convert the platform file to list format, assign a value to gpl1, and save gpl1 as an R file for future calls.

>colnames(Table(gpl))

#Looking at the column names of the platform files, we can see that there are ID and gene symbol. Remember which column the gene symbol is in, and here we are in column 11.

12.1 analysis of language differences and geo code sets Data download and sorting

>View(gpl1)

#Let’s learn more about the data of the platform file. Of course, you can directly select the gene symbol field here. We mainly know whether there is only one gene symbol value in the symbol.

12.1 analysis of language differences and geo code sets Data download and sorting

For the symbol in the gene symbol field here, some genes have more than one name, / / / followed by a duplicate name. We need the first name, so we need to use the character segmentation package to process it again. The principle is the same as that of the title above.

>gpl1$`Gene Symbol`=str_split(gpl1[,11],’///’,simplify = T)[,1]

12.1 analysis of language differences and geo code sets Data download and sorting

Is it the only one now.

>probe2symbol_df<-gpl1[,c(1,11)]

#Extract the ID and gene symbol in the platform file gpl1 and assign them to probe2symbol_ df

>colnames(probe2symbol_df)=c(‘probe_id’,’symbol’)

#Change the column name to probe_ ID and symbol

#These two steps are because I am lazy and don’t want to adjust the code. In order to facilitate the operation of the later code, I change it, and you can not change it.

>length(unique(probe2symbol_df$symbol))

#View the number of symbols with unique values

12.1 analysis of language differences and geo code sets Data download and sorting

>table(sort(table(probe2symbol_df$symbol)))

#Looking at the number of N occurrences of each gene, we can see that 7050 genes appear once and 1493 genes appear twice…

12.1 analysis of language differences and geo code sets Data download and sorting

>ids=probe2symbol_df[probe2symbol_df$symbol != ”,]

#Remove the probes without annotation symbol (in fact, the number of probes without annotation here is the 440 gene with the most frequent occurrence, that is, 440 probes without symbol)

>ids=probe2symbol_df[probe2symbol_df$probe_id %in%  rownames(dat),]   

##%In% is used to judge whether it matches,

#Note that dat here is the data of GSET expression matrix. This step is to match the ID of the platform file with the ID in the matrix.

>dat=dat[ids$probe_id,]

#Take those in the expression matrix that can match the probe name and remove the expression data that cannot be matched

>ids$mean=apply(dat,1,mean)  

#IDS creates a new column called mean. At the same time, it operates the dat matrix by row, takes the mean value of each row (that is, the average value of each sample on the probe), and gives the result to each row of the column mean. Here, it can also be changed to the median value, median

>ids=ids[order(ids$symbol,ids$mean,decreasing = T),]    

#That is, sort by symbol first, and then sort the same symbols according to the average value from large to small, so as to keep the first value later.

>ids=ids[!duplicated(ids$symbol),]  

#Take out the duplicate item of the symbol column, ‘!’ If it is no, the non duplicate items will be taken out and the duplicate genes will be removed

#If you choose medium, you can keep the maximum expression of each gene Finally, n genes were obtained.

>dat=dat[ids$probe_id,]

#Remove the probe from the new IDS_ In the column ID, the dat will form a new dat with each row of them according to the extracted column

>rownames(dat)=ids$symbol  

#Give each row in the symbol column of IDS to dat as the row name of dat

>View(dat)

12.1 analysis of language differences and geo code sets Data download and sorting

Now get the matrix with row name as gene name and column name as sample name.

>dat<-dat[-9173,]

But we can see that there is no symbol name in the last row of the matrix. We remove it and change the number ourselves.

12.1 analysis of language differences and geo code sets Data download and sorting

>save(dat,group_list,file = ‘GSE1009.Rdata’)

>write.csv(dat,file=”GSE1009expressionmetrix_GSE.csv”)

#Finally, we save the results. The next section will explain the difference analysis.

Now, the overall code is as follows:

12.1 analysis of language differences and geo code sets Data download and sorting

The files in the running rfile are as follows:

12.1 analysis of language differences and geo code sets Data download and sorting

The sorted data are as follows:

12.1 analysis of language differences and geo code sets Data download and sorting