Using logistic function and LSTM to analyze epidemic data

Time：2020-2-26

Author: Lin Zelong Mo

1. background

2019 the novel coronavirus (novel coronavirus) (SARS-CoV-2), formerly known as 2019-nCoV, commonly known as the new crown virus, is a positive chain single strand RNA coronavirus with a envelope. It is the pathogen of the new type of coronavirus infection in the end of 2019. During the outbreak, researchers detected the virus after nucleic acid testing and genome sequencing of pneumonia positive patients.

Nowadays, the epidemic situation has become the most concerned topic. Through the efforts of all parties, the epidemic situation has also been controlled correspondingly. Many professionals have different opinions on the prediction of the epidemic situation. This paper forecasts and analyzes the data of the epidemic situation based on two simple models, of course, the results are only for reference.

2. Data collection

This novel coronavirus novel coronavirus data and the China SARS epidemic data in 2003 are included in the data. The new coronavirus data are mainly from the national Wei Jian administration website and other major portals. The data of SARS mainly come from WHO. The novel coronavirus novel coronavirus data are mainly fitted by logistic regression function. SARS data are mainly used to train LSTM model, and then the new coronavirus data are analyzed based on the model.

3. Fitting curve with logistic function

Logistic function or logistic curve is a common S-shaped function, which was named by Pierre Francois veruler in 1844 or 1845 when he studied its relationship with population growth. This model is widely used in the simulation of biological reproduction, growth process and population growth process. The function formula is as follows,
$$P(t)=\frac{KP_0e^rt}{K+(P_0e^rt-1)}$$
Where $p_$is the initial value, $k$is the final value, $R$measures how fast the curve changes, $t$is the time.

The next step is to use the existing data to fit out the above equation to get the optimal parameters. Our data format is as follows. The statistical time is from January 10 to February 14. We mainly go to fit the number of confirmed cases nationwide.

Figure 1: latest epidemic data format

Our code is to use the least square method to fit the data after setting the function. The code is as follows:

def logistic_increase_function(t,K,P0,r):
t0=1
r=0.2
#The larger the R value is, the faster the model converges to K, and the smaller the R value is, the slower the model converges to K
exp_value=np.exp(r*(t-t0))
return (K*exp_value*P0)/(K+(exp_value-1)*P0)
#Estimation of the number of confirmed patients whose fitting parameter t is the time p is the corresponding time by the least square method
popt, pcov = curve_fit(logistic_increase_function, t, P)
#Popt is the best parameter coefficient after fitting
print("K:",popt[0],"P0:",popt[1],"r:",popt[2])

The comparison between the final fitted prediction function and the previous data is as follows:

Figure 2: results of logical growth function fitting

If you are interested, you can view the latest data and all codes through the project address at the bottom.

4. Using LSTM model to predict the number of infected people

Long short term memory (LSTM) is a kind of special RNN. LSTM neurons can store temporal information to solve the context and time problems without affecting its performance. Generally speaking, if we have a better LSTM model, we can use yesterday’s and today’s data to accurately predict tomorrow’s data. The introduction of LSTM has been introduced in previous articles, and will not be covered here. For those who are interested, click the introduction of LSTM stock market forecast.
Since the training of novel coronavirus requires less data, the data collected from SARS in 2003 are slightly more than those of the new type of coronavirus. The data are from March 17, 2003 to July 11, 2003, and the data are as follows.

Figure 3: SARS data format

4.1 data preprocessing

Because the data is not continuous, we process the missing data in the middle. The main method is to take the average of the data of the previous day and the data of the first day after as the value of the missing data. The code is as follows:

dataframe = pd.read_csv('SARS.csv',usecols=[1])
for i in range(dataframe['total'].shape[0]):
if dataframe['total'][i] == 0:
j=i+1
while(dataframe['total'][j]==0):
j+=1
dataframe['total'][i]=(dataframe['total'][i-1]+dataframe['total'][j])//2

4.2 normalization

Normalization can make model training faster and get better results, so we can also use it.

Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)

4.2 data processing

Due to the lack of data, we get a time step of 2, which is to use the results of nearly two days to predict the results of the third day. The code is as follows:

def create_dataset(dataset, timestep ):
dataX, dataY = [], []
for i in range(len(dataset)-timestep -1):
a = dataset[i:(i+timestep )]
dataX.append(a)
dataY.append(dataset[i + timestep ])
return np.array(dataX),np.array(dataY)
#Too few training data timestep takes 2
timestep  = 1
trainX,trainY  = create_dataset(dataset,timestep )

4.3 network construction

Our data is very simple, so we don’t need complex network. The code mainly uses the keras framework.

model = Sequential()
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
model.save("LSTM.h5")

4.4 final results

The novel coronavirus is compared with the results of SARS data training and the following two graphs. It can be found that LSTM can roughly predict the development of the disease in the next day, but the number of people increased dramatically due to the change of diagnosis mode in Hubei on February 12, but now the disease has stabilized. We hope that people’s lives can be restored to normal as soon as possible, and patients can recover as soon as possible. Project address: https://momodel.cn/workspace/5e44fecd6f6696a6d279f612? Type = app

Here are the actual training results

Figure 4: model results on training set

Figure 5: novel coronavirus curve prediction using model analysis

5、 References

1. Using logistic growth model to fit the number of confirmed cases of pneumonia in 2019-ncov
2. Introduction to LSTM stock market forecast
3. Use LSTM to predict the power consumption of home users