# Understanding RNA SEQ from a statistical perspective

Time：2022-6-20

This is to share an overview of Li Jingyi’s team《Modeling and analysis of RNA-seq data: a review from a statistical perspective》To understand the analysis of RNA SEQ from the perspective of Statistics

# Direction of analysis

Currently, there are four main directions for RNA SEQ data(of course, in fact, there are more than these. You can work hard to collect and sort them out. Welcome to discuss with me.) 1. Gene sample level is mainly used to look at the similarity of gene expression patterns between biological treatments, which is usually expressed by Pearson or Spearman correlation coefficient
2. Gene level, which involves the quantification of gene expression
3. Transcription level, which involves the quantification of different transcripts
4. Exon level, which involves the detection of differential variable shear

Next, the author focuses on these four parts for statistical understanding

## 1). Sample-level

Sample based analysis aims to detect the similarity of expression patterns of different samples, which can usually be usedPearson and Spearman correlation coefficientsTo measure. If all genes are used to calculate the correlation coefficient, the existence of housekeeping genes is bound to “exaggerate” the correlation coefficient, so a better method is to userelated geneinstead ofAll genesThe R-Pack trom is used to solve such problems. The trom calculates the trom score to selectrelated geneAfter that, calculate the correlation coefficient between samples

In addition to calculating the correlation coefficient, we can use the non-linear method t-sne or umap to perform dimension reduction clustering to observe the similarity between samples

## 2). Gene-level

The research at the gene level is mainly to quantify genes and conduct differential expression analysis. The assumption of the basic statistical model of differential expression analysis is that the distribution of the count of a gene in each sample follows the Poisson distribution or the negative binomial distribution (the values after log are generally considered to follow the normal distribution): Of which:

1. Yk,ijRepresents the expression of the jth sample gene I in condition K
2. SkjRepresents the size factor of the jth sample in condition K
3. θkiRepresents the true expression level of gene I in condition K (which can be understood as the average expression level of gene I in each sample under condition K)
4. ΦiRepresents the dispersion of gene I

The basic assumptions are:  image.png

The above figure shows the distribution of the expression of a gene A in all samples (but since there are few biological samples, statisticians often directly use the negative binomial distribution to fit), and the mean value isSkjθki
After statistical test of the difference between the two distributions, it is obvious that the expression of this gene in condition 2 is less than that in condition 1. For the calculation of P value, we can consider using displacement test to sample and calculate p value from the two distributions

The other is based on co expression analysis: Of which:

1. AijCorrelation matrix representing gene I and gene J
2. K stands for gene K
3. dij = 1 – TijTo characterize the similarity distance between genes

## 2). Transcript-level

A gene may have different transcripts. The analysis based on transcripts level is mainly to quantify the different transcripts of a gene However, there is often a problem in the quantification of transcripts, that is, for the same gene, the sequence of some transcripts has overlap, so when the reads are compared back, it is difficult to distinguish which transcripts these reads come from. Therefore, statisticians often use EM algorithm for the quantification of transcripts

It is defined as follows: θjIndicates the probability that reads comes from isoform J
The set defining isoforms is: {1, 2, 3,…, J}

Region based：
hypothesisX={ Xs | s∈S }，XsRepresents the total number of reads from map to region s, assuming that the total number of reads from map to region s obeys λsPoisson distribution of: Here are assumed parameters λsLinear relationship satisfied: Assume the following example: There are three isoforms,X heresSpecifically map to reads on ExonsIn this example, there are four exons,Xs = Xs1 + Xs2 + Xs3 + Xs4
For each transcript, if the transcript lacks an exon, the number of reads on the exon is 0, and the likelihood function: The polynomial value of the corresponding exon region is 1 (equivalent to no contribution). Using the idea of maximum likelihood estimation, our goal is to determine the likelihood functionL()Parameter when obtaining the maximum valueλsAndλsAndθjSatisfy the linear relationship * *, i.e. determineλsThen use EM algorithm toθiFor the principle of distribution, see:Understanding rsem algorithm with simple EM algorithm model

After calculation, we can get: for exampleθ1=0.37，θ2=0.33，θ3=0.3, equivalent to a total of 100 reads assigned to the region (the gene), isoform 1 expressed 37, isoform 2 expressed 33, isoform 3 expressed 30

The basic model is as follows: The basic idea of this model is to calculate the probability that reads I comes from isoform J. according to the conditional probability formula, Characterize the probability of selecting both reads I and isoform J, that is, the quantitative result

Regression-based：  The regression method is similar to the region based method, except that region based uses the maximum likelihood method to estimate parameters; The regression based method solves the parameters based on the idea of least squares

## 4). Exon-level

This section mainly analyzes variable shear events, and the PSI of variable shear events is defined as:  Of which:

1. CI denotes the number of reads supporting the inclusion isoform
2. CE denotes the number of reads supporting the exclusion isoform
3. LI and LE denote the lengths or the adjusted lengths

The statistical model of variable shear is: For example, the distribution of reads of the inclusion event is satisfied that the total number of reads isn = CI + CE, the probability that reads belongs to inclusion isψ（PSI）Binomial distribution (mean μ = n × p) , and the variable shear event for judging the difference is:
Construct binomial distributions of different conditions. For a The two distributions were statistically different (cIkThe distribution of is different), so it is judged as a variable shear event There was no difference between the two distributions (cIkThere is no difference in the distribution of, so it is judged as a non differential variable shear event

## Android screen display process analysis (5)

Original content of Nubia technical team, please indicate the source for reprint. 8. how the application is plotted At present, many game applications are applied to the canvas by surfaceview, and then frame independently, without relying on the Vsync signal. Therefore, this chapter uses several HelloWorld examples to see how the application side draws and […]