What is the meaning of log2fc and FDR values when genes are differentially expressed on September 16, 2021?


Transcriptome analysisdifferential expressionGenes appear in the resultsLog2FCandFDRValue, what do these two mean?

log2FCFC in is fold change, which represents the ratio of expression between two samples (groups). After taking the logarithm based on 2, it is log2fc. Generally, the absolute value of log2fc is greater than 1 by default as the screening standard of differential genes;

FDRThat is, false discovery rate, which is obtained by correcting the difference significance p-value. Due to transcriptome sequencingdifferential expressionAnalysis is an independent statistical hypothesis test for a large number of gene expression values, which will have the problem of false positive. Therefore, in the process of differential expression analysis, the recognized Benjamin Hochberg correction method is used to correct the significance p-value obtained from the original hypothesis test, and finally FDR is used as the key index of differential expression gene screening. Generally, FDR < 0.01 or 0.05 is taken as the default standard.

The selection of these two indicators is generally based on empirical values, which can not be adjusted at all. In the experiment, if the number of genes is too low or too high, the index can be fine tuned.

In fact, the dotted lines in the differential expression volcano map (as shown in the figure below) are the embodiment of these two indicators.

What is the meaning of log2fc and FDR values when genes are differentially expressed on September 16, 2021?

Why use FDR

In transcriptome analysis, how to determine whether the expression of a transcript is different in different samples is one of the core contents of the analysis. Generally speaking, we believe that transcripts with more than twice the expression difference in different samples are transcripts with different expression. In order to judge whether the expression difference between the two samples is caused by various errors or essential differences, we need to conduct hypothesis test according to the expression data of all genes in the two samples. The commonly used hypothesis testing methods include t-test, chi square test and so on. Many people who have just come into contact with transcriptome analysis may have such a question, is a transcriptdifferential expression, don’t you just finish the hypothesis test and look at p-value? Why is there such a new concept as FDR? This is because transcriptome analysis does not analyze one or several transcripts. Transcriptome analysis analyzes all transcripts transcribed and expressed in a sample. Therefore, the number of transcripts in a sample requires hypothesis testing. This can lead to a serious problem. In a single hypothesis test, the low proportion of false positives will accumulate to a very alarming extent. Take a less rigorous example.

Suppose there is such a project:

● two samples were included, and the expression data of 10000 transcripts were obtained,

● the expression levels of 100 transcripts were different between the two samples.

● targeting individual genesdifferential expressionThere were 1% false positives in the analysis.

Due to the existence of 1% false positive results, after we analyze these 10000 genes, we will get 100 false positive results, plus 100 real results, a total of 200 results. In this example, 50% of the 200 differentially expressed genes obtained in one analysis are false positive, which is obviously unacceptable. To solve this problem, the concept of FDR was introduced to control the proportion of false positives in the final analysis results.

How to calculate FDR

The calculation of FDR is corrected according to the p-value of hypothesis test. Generally speaking, the calculation of FDR adopts Benjamin Hochberg method (BH method for short), and the calculation method is as follows:

1. Arrange all p-values in ascending order P-value is recorded as P, the serial number of P-value is recorded as I, and the total number of p-values is recorded as M

2. FDR(i)=P(i)*m/i

3. Execute FDR (I) = min {FDR (I), FDR (I + 1)} according to the value of I from large to small

Note: in fact, the original algorithm of BH method is to find a maximum I, which meets the threshold of P ≤ I / m * FDR. At this time, all data less than I can be considered significant. In practice, in order to analyze the data with different FDR thresholds conveniently, the method in step 3 is adopted. This method can ensure that no matter how many FDR thresholds are selected, all significant data can be found directly according to the value of FDR.

Let’s take the FDR calculation process as an example

What is the meaning of log2fc and FDR values when genes are differentially expressed on September 16, 2021?

In this example, the first column is the original p-value, the second column is the sequence number after sorting, the third column is the initial FDR corrected according to p-value, and the fourth column is the FDR value finally used for filtering data. If we set FDR < 0.05, the two data highlighted in green are the data considered significant by the final analysis.

The threshold selection of FDR is a very important link in transcriptome analysis. The commonly used thresholds include 0.01, 0.05, 0.1 and so on. In practice, we can also choose flexibly according to the actual needs. For example, in the transcriptome analysis of fungi or prokaryotes, due to the small number of transcripts of these species and the low degree of false positive accumulation, the FDR threshold can be set higher appropriately, so as to obtain more differential expression results, which is conducive to subsequent analysis.