Author: Andre Ye
Deep hub translation team: Meng Xiangjie
Many people did not expect that viruses, like other creatures on earth struggling to survive, would evolve or mutate.
Just look at the viral RNA sequence fragments carried by bats of human origin.
… And the RNA sequence of human covid-19 virus was extracted
… It is clear that the coronavirus has changed its internal structure to accommodate its new host (more precisely, about 20% of the internal structure of the coronavirus has mutated), but it is still alive enough to survive in the original species.
In fact, studies have shown that covid-19 improves their survival by repeatedly mutating. In the fight against coronavirus, we not only need to find out how to eliminate the virus, but also need to find out how the virus mutates and how to contain these mutations.
In this article, I will
- Provide a simple explanation of RNA sequence
- Using k-means to create genome information cluster
- Using PCA to visualize cluster
… And analyze each program we execute to gain experience.
What is a genome sequence?
If you have a basic understanding of RNA sequences, skip this section.
Compared with “decoding”, genome sequencing is usually the process of analyzing DNA extracted from a sample. Within each normal cell, there are 23 pairs of chromosomes that hold DNA.
The coiled double helix structure of DNA enables it to expand into a trapezoid. The trapezoid is made of pairs of chemical letters called bases. There are only four kinds of DNA: adenine, thymine, guanine and cytosine. Adenine only binds to thymine, while guanine only binds to cytosine. These bases are represented by a, t, G and C, respectively.
These bases make up the various codes that tell organisms how to build proteins – actually DNA that controls the behavior of viruses.
The process of converting DNA into RNA and then into protein
Special equipment including sequencing instruments and special tags can be used to reveal the DNA sequence of specific fragments. The information obtained will be used for further analysis and comparison, so that researchers can identify the process of gene changes and associate genes with diseases and phenotypes as well as potential drug target cells.
The genome sequence is a long string of “a”, “t”, “g” and “C”, representing the way organisms respond to the environment. Biological mutations can be caused by changing DNA. Looking at the genome sequence is a powerful method to analyze coronavirus mutations.
The data found on kaggle are as follows:
Each line represents a mutation of bat virus. First, it takes only a minute to appreciate the incredible nature – within weeks, the coronavirus has produced 262 self mutations to improve its survival rate.
Some important columns:
query acc.verRepresents the original virus identifier.
subject acc.verIs the identifier of the virus mutation.
% identityRepresents the same percentage of the current sequence as the original virus.
alignment lengthRepresents how many identical identifiers are in the sequence.
mismatchesRepresents the number of mutations and primitive mutations.
bit scoreRepresents a measure of alignment; the higher the score, the better the alignment.
Some statistical measures for each column (which can be used in Python
data.describe（）Call it conveniently:
% identityInterestingly, the minimum value is about 77.6%.
% identityThe standard deviation is 7%. This value is quite large, which means that there is a wide range of possible mutations.
bit scoreThe larger standard deviation supports this view – the standard deviation is greater than the average!
A good way to visualize data is to correlate heat maps. Each cell represents the degree of association between one feature and another.
You can see that a lot of data is highly correlated with each other. This makes sense, because most mutations are different from each other. One thing to note is that
bit scoreHigh correlation.
Using k-means to create mutation clustering
K-means is an algorithm for clustering. It is a method of finding data points in feature space and combining them into groups in machine learning. The goal of our K-means is to find mutation clusters, from which we can get insight into the nature of mutations and how to solve them.
However, we still need to choose the cluster number K. Although this is as simple as drawing points in two dimensions, it cannot be done in higher dimensions (if we want to keep the most information). The method of selecting K like elbow method is subjective and inaccurate, so we will use the silhouette method.
The contour method is to evaluate the fitness of the clustering results given by K clustering centers. In Python
sklearnThe library makes the implementation of K-means and contour method very simple.
It seems that five cluster centers are the most suitable. Now, we can determine the cluster center. These cluster centers are the points around which each category is located, representing (in this case) a numerical evaluation of the five major mutation types.
Note: the data has been standardized to scale them all to the same scale. Otherwise, each column will not be comparable.
This heat map represents the attributes of each cluster in columns. Because these points are scaled, the values indicated in the figure have no significance in quantity. However, you can compare the dimension values in each column. You can visually understand the relative properties of each mutation cluster. If scientists are to develop a vaccine, it should target these major viral mutation clusters.
In the next section, we’ll use PCA to visualize the data.
PCA data visualization
PCA (principal component analysis) is a dimensionality reduction method. It selects orthogonal vectors in multidimensional space to represent the axis, thus preserving the most information (variance).
Popular Python Library
sklearnPCA can be implemented in two lines of code. First, we can check the variance ratio of the data. This is the percentage of the original data that is retained from the set. In this case, the variance ratio of the data is
0.9838548580740327It’s already very high! We can rest assured that no matter what analysis we do from PCA, the data will not be distorted.
Each new feature (major component) is a linear combination of other columns. We can visualize the importance of one of these columns to one of the two other components in relation to each other.
It is important to understand the meaning of a higher score in component 1 – in this case, its characteristics have a longer alignment length (closer to the original virus), while the main feature of component 2 has a shorter alignment length (farther away from the original value). This is also reflected in
bit scoreOn the larger difference.
Obviously, there are five main ways of virus mutation. We can get a lot of information from it.
One of the four mutations is located on the right. The characteristic of component 1 is high
alignment length。 This means that the higher the value of component one, the longer the alignment length (closer to the original virus). Therefore, when the value of component 1 is lower, it is far away from the original virus. Therefore, most virus mutations are quite different from the original virus. Therefore, scientists trying to make a vaccine should be aware that the virus produces a large number of mutations that are very different from the original virus.
Using k-means and PCA, five major mutation clusters in coronavirus can be identified. Scientists developing coronavirus vaccines can use the information from the cluster centers to obtain knowledge about each cluster feature. We can use principal component analysis to visualize clusters in two-dimensional space and find that coronavirus has a high mutation rate. That may be why it’s so lethal.
Thank you for reading!