Knnimputer: a reliable missing value interpolation method

Time:2020-10-26

By Kaushik
Compile | VK
Source | analytics vidhya

summary

  • Learn to use knnimputer to fill missing values in data

  • Understanding missing values and their types

introduce

The knnimputer of scikit learn company is a widely used missing value interpolation method. It is widely considered as an alternative to traditional interpolation techniques.

In today’s world, data are collected from many sources for analysis, insight generation, theory validation, etc. These data collected from different sources often lose some information. This could be due to a problem in the data collection or extraction process, which could be human error.

Dealing with these missing values becomes an important step in data preprocessing. The choice of interpolation method is very important because it has a significant impact on the work.

Most statistical and machine learning algorithms focus on complete observations of data sets. Therefore, it is very important to deal with the lost information.

Some literature in statistics deals with the sources of missing values and methods to overcome this problem. The best way to estimate these missing observations is to use estimates.

In this paper, we introduce a guideline for filling missing values in a dataset with observations from adjacent data points. For this purpose, we use the knnimputer implementation of scikit learn.

catalog

  • Degree of freedom

  • Missing value pattern

  • The essence of KNN algorithm

  • Distance calculation with missing values

  • Knnimputer interpolation method

Degree of freedom

For any data set, the missing data set may be a wasp. Variables with missing values can be a very important problem because there is no easy way to deal with them.

In general, if the proportion of missing observations in the data is small relative to the total number of observations, we can simply delete these observations.

However, this is not common. Deleting rows that contain missing values may cause useful information to be discarded.

From a statistical point of view, with the decrease of the number of independent information, the degree of freedom decreases.

Missing value pattern

For real data sets, missing values are a cause for concern. When collecting observations about variables, values may be lost for a variety of reasons

  • Errors in machine / equipment

  • Researchers’ mistakes

  • Interviewees who cannot be contacted

  • Accidental deletion

  • Some respondents are forgetful

  • Accounting errors, etc.

The types of missing values are generally divided into:

Complete random deletion (MCAR)

This happens when the missing value has no hidden dependency on any other variable or on any feature of the observed value. If doctors forget to record the age of every 10 patients entering the ICU, the presence of missing values does not depend on patient characteristics.

Random deletion (MAR)

In this case, the probability of missing values depends on the characteristics of the observable data. In the survey data, high-income respondents are less likely to tell researchers the number of properties they own. The missing value of the variable number of properties owned will depend on the income variable.

Nonrandom deletion (MNAR)

This happens when the missing value depends on both the characteristics of the data and the missing value. In this case, it is difficult to determine the generation mechanism of missing values. For example, missing values of variables such as blood pressure may depend in part on the blood pressure value, because patients with hypotension are less likely to check their blood pressure regularly.

The essence of KNN algorithm

The univariate method for imputation of missing values is a simple method of estimating values and may not always provide accurate information.

For example, assuming that we have variables related to vehicle density on the road and pollutant levels in the air, and there are few observations on pollutant levels, it may not be a suitable strategy to estimate pollutant levels using mean or median pollutant levels.

In this case, algorithms like k-nearest neighbor (KNN) can help to interpolate the values of missing data.

Sociologists and community researchers believe that the reason why human beings live in a community is because of the sense of security, attachment to the community and the interpersonal relationship that creates community identity through participation in various activities.

A similar interpolation method that works on data is k-nearest neighbor (KNN), which identifies adjacent points by distance measurement and estimates missing values using the complete values of adjacent observations.

example

Suppose you don’t have any essential food in stock. Because of the blockade, the shops nearby are closed. So if you ask your neighbors for help, you will eventually accept whatever they offer you.

This is an example of interpolation from 3-nn.

Instead, if you identify three neighbors who ask for help and choose to combine the items provided by the three nearest neighbors, this is an example of 3-nn interpolation.

Similarly, the missing values in the dataset can be calculated from the observations of k-nearest neighbors in the dataset. The adjacent points in the dataset are identified by a certain distance measure, usually Euclidean distance.

Consider the graph above showing the operation of KNN. In this case, the ellipse area represents the adjacent points of the green square data point. We use distance to identify neighbors.

The idea of KNN method is to identify K samples with similar or similar space in the data set. Then we use these “K” samples to estimate the value of missing data points. The missing values of each sample are interpolated using the average value of the “K” neighborhood found in the dataset.

Distance calculation with missing values

Let’s look at an example to understand this. Consider a pair of observations in two-dimensional space (2,0), (2,2), (3,3). The graphical representation of these points is as follows:

The shortest distance point based on Euclidean distance is considered as the nearest neighbor. For example, the nearest neighbor of point a to 1 is point B. For point B, the nearest neighbor of 1 is point C.

In the case of missing coordinates, the Euclidean distance is calculated by ignoring the missing values and enlarging the weight of the non missing coordinates.

among

For example, the Euclidean distance between two points (3, Na, 5) and (1, 0, 0) is:

Now we use thenan_euclidean_distancesFunction to calculate the distance between two missing values.

althoughnan_euclidean_distancesIt can also be applied to arrays with more than one dimension provided by a single dimension and y.

Therefore, the distance matrix is a 2 × 2 matrix, which represents the Euclidean distance between the observation pairs. In addition, the diagonal element of the composite matrix is 0 because it represents the distance between a single observation and itself.

Knnimputer interpolation method

We will use sklearn’simputeIn the moduleKNNImputerFunction. Knnimputer uses Euclidean distance matrix to find the nearest neighbor to help estimate the missing values in observation.

In this case, the code above shows that observation 1 (3, Na, 5) and observation 3 (3, 3, 3) are closest in distance (~ 2.45).

Therefore, the missing value in observation value 1 (3, Na, 5) is interpolated with a 1-Nearest neighbor. The estimated value is 3, which is the same as the second dimension of observation value 3 (3, 3, 3).

In addition, using a 2-nearest neighbor to estimate the missing values in observations 1 (3, Na, 5) yields an estimate of 1.5, which is the same as the average values of the second dimension of observations 2 and 3, i.e. (1, 0, 0) and (3, 3, 3).

So far, we have discussed using knnimputer to deal with missing values of continuous variables. Next, we create a data frame that contains the missing values in the discrete variable.

In order to fill the missing values in discrete variables, we have to encode the discrete values into numerical values because knnimputer is only valid for numerical variables. We can do this using a mapping of categories to numeric variables.

ending

In this article, we learned about missing values, why, and how to use knnimputer to fill in missing values. Choosing K to use KNN algorithm to fill in missing values may be the focus of controversy.

In addition, the research shows that it is necessary to use cross validation to test the model after using different K values for interpolation. Although missing value interpolation is a developing research field, KNN is a simple and effective strategy.

Link to the original text: https://www.analyticsvidhya.com/blog/2020/07/knnimputer-a-robust-way-to-impute-missing-values-using-scikit-learn/

Welcome to visit pan Chuang AI blog station:
http://panchuang.net/

Sklearn machine learning Chinese official document:
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource collection station:
http://docs.panchuang.net/

Recommended Today

Layout of angular material (2): layout container

Layout container Layout and container Using thelayoutDirective to specify the layout direction for its child elements: arrange horizontally(layout=”row”)Or vertically(layout=”column”)。 Note that if thelayoutInstruction has no value, thenrowIs the default layout direction. row: items arranged horizontally.max-height = 100%andmax-widthIs the width of the item in the container. column: items arranged vertically.max-width = 100%andmax-heightIs the height of the […]