R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Time:2021-12-2

Original link:  http://tecdat.cn/?p=22181

This paper considers classification prediction based on kernel method. Note that here, we do not use standard logistic regression, it is a parametric model.

Nonparametric method

There are three nonparametric methods for function estimation: kernel method, local polynomial method and spline method.
The advantage of nonparametric function estimation is robust. There is no specific assumption about the model, but the function is considered smooth, which avoids the risk caused by model selection; However, the complex expression, difficult to explain and large amount of calculation are a big problem of nonparametric. Therefore, the use of non participation has risks, and the choice needs to be cautious.
The idea of nonparametric is very simple: the probability of taking the observed value of the function at the observed point is large, and the value of function f (x) is estimated by weighted average using the value near X.

Nuclear method

When the weighted weight is the kernel of a function, this method is the kernel method. The common methods are nadaraya Watson kernel estimation and gasser Muller kernel estimation, that is, NW kernel estimation and GM kernel estimation mentioned in many textbooks. Here we still don’t talk about the choice of kernel, and all kernel estimates are processed with Gauss kernel by default.
The NW kernel estimation form is:

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

 

The GM kernel estimation form is:

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

WhereR language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

data

Use heart disease data to predict myocardial infarction in emergency patients, including variables:

Cardiac index
Stroke volume index
diastolic pressure
Pulmonary artery pressure
Ventricular pressure
Pulmonary resistance
Is it alive
Now that we know what the kernel estimate is, we assume that K is the density of the n (0,1) distribution. At point x, using bandwidth h, we get the following code

Dnorm ((stroke volume index-x) / BW, mean = 0, SD = 1)
Weighted. Mean (survival, w)}
plot(u,v,ylim=0:1,

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Of course, we can change the bandwidth.

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Vectorize( mean_x(x,2))(u)

We observed that the smaller the bandwidth, the greater the variance and the smaller the deviation. “Greater variance” here means greater variability (because the smaller the neighborhood, the fewer points to calculate the average value and the more unstable the estimated value), and “the smaller the deviation”, the immediate expected value should be calculated at point x, so the smaller the neighborhood, the better.

Use smooth function

Use the R function to calculate the kernel regression.

Smooth (stroke volume index, survival, ban = 2 * exp (1)

We can copy the previous estimates. However, the output is not a function, but a sequence of two vectors. In addition, as we can see, the bandwidth is not exactly the same as the bandwidth we used before.

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Smooth (stroke volume index, survival, "normal", bandwidth = BK)
optim(bk,f)$par}
x=seq(1,10,by=.1)
plot(x,y)
abline(0,exp(-1),col="red")

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

The slope is 0.37, which is actually e ^ {- 1}.

High dimensional application

Now consider our bivariate data set and consider the product of some univariate (Gaussian) kernels

  w = dnorm((df$x1-x)/bw1, mean=0,sd=1)*
      dnorm((df$x2-y)/bw2, mean=0,sd=1)
  w.mean(df$y=="1",w)
contour(u,u,v,levels = .5,add=TRUE)


We get the following predictions

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Here, different colors are probabilities.

K-NN (k-nearest neighbor algorithm)

Another method is to consider a neighborhood, which is not defined by the distance to the point, but by the N observations we get to define the K neighborhood (that is, the k-nearest neighbor algorithm).

Next, we write our own function to implement k-NN (k-nearest neighbor algorithm):

The difficulty is that we need an effective distance.

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

If the units of each component are very different, it makes no sense to use Euclidean distance. So we consider Mahalanobis distance

mahalanobis = function(x,y,Sinv){as.numeric(x-y)%*%Sinv%*%t(x-y)}
mahalanobis(my[i,1:7],my[j,1:7])

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Here we have a function to find k nearest neighbor observation samples. Then you can do two things to get a prediction. Our goal is to predict a class, so we can consider using a majority rule: the prediction of Yi is the same as that of most neighbor samples.

For (I in 1: length (y)) y [i] = sort (survival [k_closest (I, K)]) [(K + 1) / 2]


We can also calculate the proportion of black spots in our nearest neighbors. It can actually be interpreted as the probability that it is black,

For (I in 1: length (y)) y [i] = mean


We can see the observations on the data set, the prediction based on the majority principle, and the proportion of death samples in the seven nearest neighbors

k_ma(7),PROPORTION=k_mean(7))

Here, we get a prediction of the observation point located in X, but in fact, we can find the nearest neighbor K of any X. Back to our univariate example (get a chart), we have

W = rank (ABS, method = "random")
  Mean (survival [which (< = 9)]}


R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

It’s not very smooth, but we don’t have many points.
If we use this method on two-dimensional data sets, we will get the following results.

  k = 6
   dist = function(j)  mahalanobis(c(x,y))
  vect = Vectorize( dist)(1:nrow(df)) 
  idx  = which(rank(vect<=k)
 
contour(u,u,v,levels = .5,add=TRUE)
 


R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

This is the idea of local reasoning, using kernel to infer the neighborhood of X, or using k-NN nearest neighbor.

R language nonparametric method: using kernel method and k-NN (k-nearest neighbor algorithm) to classify and predict heart disease data

Most popular insights

1. Why employees leave from the decision tree model

2. R language tree based method: decision tree, random forest

3. Using scikit learn and pandas decision tree in Python

4. Machine learning: run random forest data analysis report in SAS

5. R language uses random forest and text mining to improve airline customer satisfaction

6. Machine learning boosts fast fashion and accurate sales time series

7. Using machine learning to identify the changing stock market situation — Application of hidden Markov model

8. Python machine learning: implementation of recommendation system (collaborative filtering by matrix decomposition)

9. Using Python machine learning classification to predict bank customer churn in Python