Author: ShiChao Wu
Liver disease may be asymptomatic in the early stages, not easily detected, or the symptoms may be vague. The symptoms of liver disease are highly correlated with the type and degree of liver disease, which is generally diagnosed by liver function tests. In the diagnosis of common liver function tests, there are generally three main categories of indicators: serum enzymes, bilirubin and serum proteins. Among them, the medical indicators in serum enzymes mainly include alanine aminotransferase, aspartate aminotransferase and alkaline phosphatase, etc. When liver cells are destroyed, the enzymes will be released into the blood in large quantities, causing the indicators to rise . Bilirubin indexes include total bilirubin, direct bilirubin, and indirect bilirubin, etc., which reflect the metabolism of bilirubin. rise. Serum protein index reflects the synthetic function of the liver, including albumin, globulin, total protein, etc., and can be used to detect chronic liver injury, immunity, etc. Early diagnosis can improve the survival rate of patients with liver disease, and it is a very important method to diagnose liver disease by the levels of enzymes, bilirubin, and serum proteins in the blood.
numberAccording to the sourcePrepare
The experimental data set (Indian Liver Patient Datset,, ILPD) comes from a statistical learning website UCI of the University of California, USA. ILPD was collected by three Indian professors from the northeastern part of Andhra Pradesh, India. The dataset contains 416 records of patients with liver disease and 167 records of patients without liver disease, including 441 male patient records and 142 female patient records. The 89-year-old patients were all listed as 90-year-olds.
Descriptive Statistical Analysis
Descriptive analysis of the patient's condition based on the patient's physiological indicators and medical testing indicators (in the following figures, 1 represents disease, and 2 represents no disease):
Figure 1 Distribution of age and total protein
It can be seen from Figure 1 that the average age (median) of people with liver disease is older than that of people without liver disease. It may be that older people are more likely to suffer from liver disease due to the pressure of life and work. . The average (median) level of total protein in the blood of people with liver disease is not significantly different from the average (median) level of total protein in the blood of people without liver disease, which may be used to determine whether someone has heart disease. This indicator of protein accounts for a small proportion.
Figure 2 Distribution of albumin and globulin ratios
It can be seen from Figure 2 that the average (median) level of albumin in the blood of people with liver disease is significantly lower than that of people without liver disease. The effect of liver disease is greater. The ratio of albumin to globulin in the blood indicates that the average (median) level of people with liver disease is significantly lower than that of people without liver disease, which may be used to determine whether someone has liver disease. Albumin and globulin Indicators are more important.
Figure 3 Distribution of prevalence and gender
It can be seen from Figure 3 that the number of males in the sick population is about three times the number of females, which is slightly different from the distribution of the population with liver disease in reality. In some cases, fewer data were collected for women; the ratio of men with liver disease to those without liver disease was about 3:7, and the ratio of women with liver disease to those without liver disease was about 4:6. Gender may have a certain influence on the disease.
Figure 4 Distribution of medical indicators
From Figure 4, total bilirubin (TBIL), direct bilirubin (DBIL), alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST) were observed in the diseased population The five characteristics showed a significant right-skewed distribution, which may be due to the fact that the medical indicators of people with liver disease are higher than those of ordinary people.
Some experimental data
R language modeling
Binomial logistic regression model is a binary classification model based on logistic distribution and is a supervised machine learning method. The basic idea is to compare conditional probabilities, those with a probability value greater than 0.5 belong to the positive class, and those with a probability value less than 0.5 belong to the negative class.
Build a forest in a random way. The forest consists of many decision trees. There is no relationship between each decision tree in the random forest. After getting the forest, when a new input sample enters, let each decision tree in the forest make a judgment to see which category the sample should belong to (for the classification algorithm), and then see which One class is selected the most, and the sample is predicted to be that class.
Decision tree (decision tree) is a supervised machine learning method that can be used for classification and regression. The model of decision tree is distributed in a tree structure, which can perform feature selection on instances in the classification process to achieve classification. The classification decision tree model describes a tree structure model for classifying instances. The structure of the decision tree is a node (node) and a directed edge (directed edge). Nodes can be divided into leaf nodes (leaf nodes) and internal nodes ( internal node). Leaf nodes represent classes, and internal nodes represent features.
Support Vector Regression (SVR)
Support vector machines (support vector machines, SVM) were discovered by Vapink in 1979. In 1995, Vapink proposed to use support vector machines for regression and classification. A support vector machine is a supervised machine learning algorithm whose purpose is to find an optimal hyperplane and then divide the data into different classes.
It can be seen from the model results that the likelihood ratio of the full model is 0.4928, and many indicators are not significant, so consider using AIC and BIC for subset selection, so that the obtained model is more accurate and more convincing.
Most Popular Insights
About the author
ShiChao Wu isTuoduan Research Laboratory (TRL)researcher.
As a master of Statistics Department of 211 School, he fully understands the importance of data analysis in modern production and operation and maintenance. In the era of big data, the technical backbone of high-tech enterprises is getting younger and younger, and the status of data analysts is becoming more and more important.