Introduction to zero foundation data mining

Time:2021-6-10

This is the content of the student manual, which should be coded first, and then gradually improved and revised.

Understanding of task 1

Tip: this rookie competition is the fifth competition of zero basic entry series jointly sponsored by datawhale and Tianchi – zero basic entry ECG heartbeat signal multi classification prediction challenge.

In June 2016, the general office of the State Council issued the guidance of the general office of the State Council on promoting and standardizing the application and development of health care big data, which pointed out that the application and development of health care big data will bring profound changes to the health care model, which is conducive to improving the efficiency and quality of health care services.

The competition is based on the ECG data, and the contestants are required to predict the heartbeat signal according to the ECG induction data. The heartbeat signal corresponds to the normal cases and the cases affected by different arrhythmias and myocardial infarction, which is a multi classification problem. Through this competition question, we can guide you to understand the application of medical big data, and help the newcomers to self practice and self-improvement.

Project address:https://github.com/datawhalec…

Address:https://tianchi.aliyun.com/co…

1.1 learning objectives

  • Understand the data and objectives of the competition, and understand the scoring system.
  • Complete the corresponding registration, download the data and results, submit the clock in (sample results can be submitted), and be familiar with the competition process

1.2 understanding the competition

  • Overview of the competition
  • Data overview
  • Forecast index
  • Analysis of competition questions

1.2.1 overview of competition questions

According to the given data set, competitors are required to build models to predict different heartbeat signals. The task of the competition is to predict the types of ECG heartbeat signals. The data set can be seen and downloaded after registration. The data is from the ECG data records of a certain platform, with a total data volume of more than 200000. It is mainly a column of heartbeat signal sequence data, in which the sampling frequency of each sample is consistent and the length is equal. In order to ensure the fairness of the competition, 100000 pieces will be selected as the training set, 20000 pieces as the test set a, 20000 pieces as the test set B, and the heartbeat signal category (label) information will be desensitized.

Through this competition to guide you into the world of medical big data, mainly for the competition of new self practice, self-improvement.

1.2.2 data overview

Generally speaking, for data in the competition interface, there are corresponding data profiles (except anonymous features) to explain the properties of the column. Understanding the nature of the column will help us to understand the data and subsequent analysis.

Tip: anonymous feature is a feature column that does not tell the property of the data column.

train.csv

  • ID is the unique identification assigned to heartbeat signal
  • heartbeat_ Heartbeat signal sequence (separated by “,” between data)
  • Label heartbeat signal category (0, 1, 2, 3)

testA.csv

  • ID: the unique identification of heartbeat assignment
  • heartbeat_ Heartbeat signal sequence (separated by “,” between data)

1.2.3 prediction index

The contestants need to submit the probabilities of four different heartbeat signal predictions. The results submitted by the contestants are compared with the results of the actual heartbeat type, and the absolute value of the difference between the predicted probability and the real value is calculated.

The specific calculation formula is as follows:

There are n cases in total. For a certain signal, if the real value is [Y1, Y2, Y3, Y4], and the predictive probability value of the model is [A1, A2, A3, A4], then the evaluation index of the model is ABS sum

$$
{abs-sum={\mathop{ \sum }\limits_{{j=1}}^{{n}}{{\mathop{ \sum }\limits_{{i=1}}^{{4}}{{ \left| {y\mathop{{}}\nolimits_{{i}}-a\mathop{{}}\nolimits_{{i}}} \right| }}}}}}
$$

For example, if the category of a heartbeat signal is 1, it is converted into [0,1,0,0] by coding, and the probability of predicting different heartbeat signals is [0.1,0.7,0.1,0.1], then the ABS sum of the prediction result of this signal is

$$
{abs-sum={ \left| {0.1-0} \right| }+{ \left| {0.7-1} \right| }+{ \left| {0.1-0} \right| }+{ \left| {0.1-0} \right| }=0.6}
$$

The common evaluation indexes of multi classification algorithm are as follows:

In fact, the calculation method of multi category evaluation index is exactly the same as that of two categories, except that we calculate the recall rate, accuracy rate, accuracy rate and F1 score for each category.

1. Confusion matrix

  • (1) If an instance is a positive class and is predicted to be a positive class, it is the real class TP (true positive)
  • (2) If an instance is a positive class but is predicted to be a negative class, it is a false negative class FN (false negative)
  • (3) If an instance is a negative class but is predicted to be a positive class, it is a false positive class FP (false positive)
  • (4) If an instance is a negative class and is predicted to be a negative class, it is a true negative class TN (true negative)

The first letter T / F indicates whether the prediction is correct or not; The second letter P / N indicates that the predicted result is positive or negative. For example, TP means that the prediction is correct and the result of prediction is positive example, which means that the positive example is predicted as positive example.

2. Accuracy is a commonly used evaluation index, but it is not suitable for the case of sample imbalance. Most of the medical data are sample imbalance data.

$$
Accuracy=\frac{Correct}{Total}\\
$$

$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$

3. Precision is also called precision, abbreviated as P

$$
Precision = \frac{TP}{TP + FP}
$$

PrecisionFor the prediction results, the meaning isThe probability of a positive sample among all predicted positive samplesAmong all the predicted positive samples, the probability, accuracy and accuracy of the actual positive samples look similar, but they are two completely different concepts. The accuracy rate represents the prediction accuracy of positive samples, and the accuracy rate represents the overall prediction accuracy, including positive samples and negative samples.

Recall is also called recall, abbreviated as R

$$
Recall = \frac{TP}{TP + FN}
$$

RecallIt refers to the original sample, which meansThe probability of being predicted as a positive sample in the actual positive sample

Let’s take a simple example to see the accuracy rate and recall rate. Suppose there are 10 articles in total, of which 4 are what you are looking for. According to your algorithm model, you find five articles, but in fact, only three of them are what you really want to find.

So the accuracy rate of the algorithm is 3 / 5 = 60%, that is, three of the five articles you are looking for are really right. The recall rate of the algorithm is 3 / 4 = 75%, that is, you find three of the four articles you need to find. The accuracy rate or the recall rate as the evaluation index needs to be determined according to the specific problems.

5. Macro-p

Calculate the accuracy of each sample, and then calculate the average value

$$
{macroP=\frac{{1}}{{n}}{\mathop{ \sum }\limits_{{1}}^{{n}}{p\mathop{{}}\nolimits_{{i}}}}}
$$

6. Macro-r

Calculate the recall rate of each sample, and then calculate the average value

$$
{macroR=\frac{{1}}{{n}}{\mathop{ \sum }\limits_{{1}}^{{n}}{R\mathop{{}}\nolimits_{{i}}}}}
$$

7. Macro-F1

$$
{macroF1=\frac{{2 \times macroP \times macroR}}{{macroP+macroR}}}

$$

Different from the above macro, micro check is accurate and complete. First, the corresponding positions of TP, FP, TN and FN of multiple confusion matrices are averaged, then micro-p and micro-r are obtained according to the formula of P and R, and finally micro-F1 is obtained according to micro-p and micro-r

8. Micro-p

$$
{microP=\frac{{\overline{TP}}}{{\overline{TP} \times \overline{FN}}}}

$$

9. Micro-r

$$

{microR=\frac{{\overline{TP}}}{{\overline{TP} \times \overline{FN}}}}

$$

10. Micro-F1

$$
{microF1=\frac{{2 \times microP\times microR }}{{microP+microR}}}

$$

1.2.4 competition rules

  • After successful registration, the contestants download the data, debug the algorithm locally, and submit the results three times a day;
  • Real time evaluation will be conducted after submission; The daily update time of the ranking is 12:00 and 20:00, ranking from high to low according to the score of the evaluation index; The best results in history will be displayed in the ranking list;

1.2.5 analysis of competition questions

  • This topic is the traditional data mining problem, through data science and machine learning deep learning method to model and get the results.
  • This topic is a typical multi classification problem. There are four different categories of heartbeat signals
  • This paper mainly uses XGB, LGB, catboost, pandas, numpy, Matplotlib, seabon, sklearn, keras and other data mining libraries or frameworks to do data mining tasks.