Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Time:2022-8-8

Original link:http://tecdat.cn/?p=23534

Author: ShiChao Wu

Project Challenge

Liver disease may be asymptomatic in the early stages, not easily detected, or the symptoms may be vague. The symptoms of liver disease are highly correlated with the type and degree of liver disease, which is generally diagnosed by liver function tests. In the diagnosis of common liver function tests, there are generally three main categories of indicators: serum enzymes, bilirubin and serum proteins. Among them, the medical indicators in serum enzymes mainly include alanine aminotransferase, aspartate aminotransferase and alkaline phosphatase, etc. When liver cells are destroyed, the enzymes will be released into the blood in large quantities, causing the indicators to rise . Bilirubin indexes include total bilirubin, direct bilirubin, and indirect bilirubin, etc., which reflect the metabolism of bilirubin. rise. Serum protein index reflects the synthetic function of the liver, including albumin, globulin, total protein, etc., and can be used to detect chronic liver injury, immunity, etc. Early diagnosis can improve the survival rate of patients with liver disease, and it is a very important method to diagnose liver disease by the levels of enzymes, bilirubin, and serum proteins in the blood.

solution

numberAccording to the sourcePrepare

The experimental data set (Indian Liver Patient Datset,, ILPD) comes from a statistical learning website UCI of the University of California, USA. ILPD was collected by three Indian professors from the northeastern part of Andhra Pradesh, India. The dataset contains 416 records of patients with liver disease and 167 records of patients without liver disease, including 441 male patient records and 142 female patient records. The 89-year-old patients were all listed as 90-year-olds.

Descriptive Statistical Analysis

Descriptive analysis of the patient's condition based on the patient's physiological indicators and medical testing indicators (in the following figures, 1 represents disease, and 2 represents no disease):

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Figure 1 Distribution of age and total protein

It can be seen from Figure 1 that the average age (median) of people with liver disease is older than that of people without liver disease. It may be that older people are more likely to suffer from liver disease due to the pressure of life and work. . The average (median) level of total protein in the blood of people with liver disease is not significantly different from the average (median) level of total protein in the blood of people without liver disease, which may be used to determine whether someone has heart disease. This indicator of protein accounts for a small proportion.

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Figure 2 Distribution of albumin and globulin ratios

It can be seen from Figure 2 that the average (median) level of albumin in the blood of people with liver disease is significantly lower than that of people without liver disease. The effect of liver disease is greater. The ratio of albumin to globulin in the blood indicates that the average (median) level of people with liver disease is significantly lower than that of people without liver disease, which may be used to determine whether someone has liver disease. Albumin and globulin Indicators are more important.

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Figure 3 Distribution of prevalence and gender

It can be seen from Figure 3 that the number of males in the sick population is about three times the number of females, which is slightly different from the distribution of the population with liver disease in reality. In some cases, fewer data were collected for women; the ratio of men with liver disease to those without liver disease was about 3:7, and the ratio of women with liver disease to those without liver disease was about 4:6. Gender may have a certain influence on the disease.

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Figure 4 Distribution of medical indicators

From Figure 4, total bilirubin (TBIL), direct bilirubin (DBIL), alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST) were observed in the diseased population The five characteristics showed a significant right-skewed distribution, which may be due to the fact that the medical indicators of people with liver disease are higher than those of ordinary people.

Some experimental data

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

R language modeling

logistic regression

Binomial logistic regression model is a binary classification model based on logistic distribution and is a supervised machine learning method. The basic idea is to compare conditional probabilitiesTuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India, those with a probability value greater than 0.5 belong to the positive class, and those with a probability value less than 0.5 belong to the negative class.

random forest

Build a forest in a random way. The forest consists of many decision trees. There is no relationship between each decision tree in the random forest. After getting the forest, when a new input sample enters, let each decision tree in the forest make a judgment to see which category the sample should belong to (for the classification algorithm), and then see which One class is selected the most, and the sample is predicted to be that class.

decision tree

Decision tree (decision tree) is a supervised machine learning method that can be used for classification and regression. The model of decision tree is distributed in a tree structure, which can perform feature selection on instances in the classification process to achieve classification. The classification decision tree model describes a tree structure model for classifying instances. The structure of the decision tree is a node (node) and a directed edge (directed edge). Nodes can be divided into leaf nodes (leaf nodes) and internal nodes ( internal node). Leaf nodes represent classes, and internal nodes represent features.

Support Vector Regression (SVR)

Support vector machines (support vector machines, SVM) were discovered by Vapink in 1979. In 1995, Vapink proposed to use support vector machines for regression and classification. A support vector machine is a supervised machine learning algorithm whose purpose is to find an optimal hyperplane and then divide the data into different classes.

Project results

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

It can be seen from the model results that the likelihood ratio of the full model is 0.4928, and many indicators are not significant, so consider using AIC and BIC for subset selection, so that the obtained model is more accurate and more convincing.

Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India


Tuoduan tecdat|Machine learning-based diagnosis and analysis of liver disease in India

Most Popular Insights

1.R language multivariate logistic regression application case

2.Panel Smooth Transition Regression (PSTR) Analysis Case Implementation

3.Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR) in matlab

4.R language Poisson regression model analysis case

5.R language mixed effects logistic regression logistic model analysis of lung cancer

6.Implementation of LASSO regression, Ridge regression and Elastic Net model in r language

7.R language logistic regression, Naive Bayes Bayes, decision tree, random forest algorithm to predict heart disease

8.python use linear regression to predict stock prices

9.R language uses logistic regression, decision trees and random forests to make classification predictions on credit datasets


About the author

ShiChao Wu isTuoduan Research Laboratory (TRL)researcher.

As a master of Statistics Department of 211 School, he fully understands the importance of data analysis in modern production and operation and maintenance. In the era of big data, the technical backbone of high-tech enterprises is getting younger and younger, and the status of data analysts is becoming more and more important.