Students and teachers who study esophageal cancer must know that gse53625 data set comes from《LncRNA profile study reveals a three-lncRNA signature associated with the survival of esophageal squamous cell carcinoma patients》This article. As far as I know, this should be the largest data set of esophageal squamous cell carcinoma at present, and the classification is simple and clear, that is, cancer and normal tissue. The clinical data is complete, and I feel it is very suitable as the beginning of data mining. Therefore, after finishing the introduction course of Shengxin skill tree, I decided to start with this data set.
1、 Download data
It’s the code given by teacher Xiaojie. At first, I wanted to download it directly from R. however, the school’s network is really poor. This data set of 104mb can’t be downloaded. I can only find the expression matrix from geo and use it to download it. As a result, it took more than an hour to download it (the school’s network really wants to cry without tears). After that, I checked carefully that the file names and sizes of self download and code download are exactly the same. Then extract the expression matrix, clinical data and saverdata.
2、 Gene annotation
According to the code given by Xiao Jie, the next step is grouping and gene annotation. As mentioned earlier, the grouping of this data set is super simple. The second column gives the sample names including normal and cancer. When I thought everything was going well, the problem came when I usedhttp://www.bio-info-trainee.com/1399.htmlWhen looking for the comment file on this website, I found that gpl18109 could not be found. I can only use the second method to download the soft file in GEO, as shown below
You can see that there is only ID column, not the symbol column I am familiar with, but only the corresponding gene sequence. Do you want to make independent annotation? It collapsed. After searching, it seems that there is software that can map gene sequences to gene symbol. Do I have to learn a strange software to do gene conversion? Therefore, I used the keyword “gene sequence” in the official account of the skill tree of Shengxin. Unexpectedly, I found the tweets related to this data set, and also gave the gene annotation files prepared by the predecessorshttps://mp.weixin.qq.com/s/pMVNYn-kMljavQvnhvkS_w。
The documents to be found have been found, and the next step is the key differential gene analysis. After carefully sorting out the data according to the code given by teacher Xiaojie, he excitedly ran limma, logfc and selected 1, P_ Value is 0.05, and then the change column of the table is. It is found that there is no difference gene!!!!
After logfc was adjusted to 0.5, there were only a few hundred differential genes. It has been completely confused here, and even doubted teacher Xiaojie’s code for a second:). In order to verify my results, I searched the data mining articles published with this data set and found an article mining more than 3000 differential genes. The results must be wrong. At this time, I thought of geo2r as a web tool, downloaded the differential expression matrix, and then marked the up-regulated and down-regulated genes with codes. The results were exactly the same as those in the above article. Then everything went well, Volcano map, heat map, go, KEGG.
Because I want to do some other analysis with this original expression matrix, I decided to find out the problem. At this time, I noticed that in the first step of data processing, there is a step of log expression. Is it because the expression becomes smaller after log, the difference will not exist? (I didn’t know that the limma package had to log at that time) then I skipped this step and rerun all the code. Miraculously, I got the correct result. When I looked back at the expression matrix
Suddenly realize why the expression amount is so small, and get the correct result after skipping the log step just now….
Finally, I sincerely hope that Xiaojie can talk about this data set in class in the future. It’s really unfriendly for novices!