Matlab uses quantile random forest (QRF) regression tree to detect outliers

Time：2021-12-5

This example shows how to use quantile random forest to detect outliers. Quantile random forests can detect outliers related to the conditional distribution of Y for a given X.

Outliers are some observations whose position is far enough from most other observations in the data set to be considered abnormal. The causes of outlier observation include inherent variability or measurement error. Outliers significantly affect estimation and inference, so it is very important to detect them to decide whether to delete or robust analysis.

To demonstrate outlier detection, this example:
Data are generated from nonlinear models with heteroscedasticity, and some outliers are simulated.
Quantile random forest growth regression tree.
Estimate the conditional quartiles (Q1, Q2 and Q3) and_ Quartile_ Distance (IQR).
The observed values are compared with the boundary, which are F1 = Q1 − 1.5iqr and F2 = Q3 + 1.5iqr. Any observation less than F1 or greater than F2 is an outlier.

Generate data

Generate 500 observations from the model

Evenly distributed between 0 ~ 4 π, ε T is about n (0, t + 0.01). Store data in a table.

rng('default'); %  To ensure repeatability
randsample(linspace(0,4*pi,1e6),n,true)';
epsilon = randn(n,1).*sqrt((t+0.01));

Move the five observations 90% in a random vertical direction.

numOut = 5;
Tbl.y(idx) + randsample([-1 1],numOut,true)'.*(0.9*Tbl.y(idx));

Plot a scatter plot of the data and identify outliers.

plot(Tbl.t,Tbl.y,'.');
plot(Tbl.t(idx),Tbl.y(idx),'*');
Title ('data scatter diagram ');
Legend ('data ',' simulated outliers', 'location', 'Northwest'); Generate quantile random forest

Generate 200 regression trees.

Tree(200,'y','regression');

The return is a treebagger collection.

Prediction condition quartile and quartile interval

Using quantile regression, the conditional quartiles of 50 equidistant values in the T range were estimated.

linspace(0,4*pi,50)';
quantile(pred,'Quantile');

`Quartile is a 500 × 3 conditional quartile matrix. Rows correspond to observations in T and columns correspond to probabilities.
Plot conditional mean and median dependent variables on the scatter diagram of data`

plot(pred,[quartiles(:,2) meanY]);
Legend ('data ',' simulated outliers', 'median dependent variable', 'average dependent variable' Although the conditional mean is close to the median curve, the simulated outliers will affect the mean curve.
Calculate the conditions IQR, F1 and F2.

iqr = quartiles(:,3) - quartiles(:,1);
f1 = quartiles(:,1) - k*iqr;

K = 1.5 means that all observations less than F1 or greater than F2 are considered outliers, but this threshold cannot be distinguished from extreme outliers. When k is 3, extreme outliers can be determined.

Compare the observations with the boundary

Draw observation map and boundary.

plot(Tbl.t,Tbl.y,'.');
Legend ('data ',' simulated outliers', 'f_1', 'f_2');
Title ('outlier detection using quantile regression ') All simulated outliers are outside [F1, F2], and some observed values are also outside this interval. Most popular insights

The real problem of Alibaba IOS algorithm can’t hang up this time

More and more IOS developers continue to enter the peak of job hopping in 2020 Three main trends of interview in 2020: IOS bottom layer, algorithm, data structure and audio and video development Occupied the main battlefield. Data structure and algorithm interview, especially figure, has become the main reason for the failure of most first-line […]