Matlab uses quantile random forest (QRF) regression tree to detect outliers


Original link: 

This example shows how to use quantile random forest to detect outliers. Quantile random forests can detect outliers related to the conditional distribution of Y for a given X.

Outliers are some observations whose position is far enough from most other observations in the data set to be considered abnormal. The causes of outlier observation include inherent variability or measurement error. Outliers significantly affect estimation and inference, so it is very important to detect them to decide whether to delete or robust analysis.

To demonstrate outlier detection, this example:
Data are generated from nonlinear models with heteroscedasticity, and some outliers are simulated.
Quantile random forest growth regression tree.
Estimate the conditional quartiles (Q1, Q2 and Q3) and_ Quartile_ Distance (IQR).
The observed values are compared with the boundary, which are F1 = Q1 − 1.5iqr and F2 = Q3 + 1.5iqr. Any observation less than F1 or greater than F2 is an outlier.

Generate data

Generate 500 observations from the model

Evenly distributed between 0 ~ 4 π, ε T is about n (0, t + 0.01). Store data in a table.

rng('default'); %  To ensure repeatability
epsilon = randn(n,1).*sqrt((t+0.01));

Move the five observations 90% in a random vertical direction.

 numOut = 5;
Tbl.y(idx) + randsample([-1 1],numOut,true)'.*(0.9*Tbl.y(idx)); 

Plot a scatter plot of the data and identify outliers.

Title ('data scatter diagram ');
Legend ('data ',' simulated outliers', 'location', 'Northwest');

Matlab uses quantile random forest (QRF) regression tree to detect outliers

Generate quantile random forest

Generate 200 regression trees.


The return is a treebagger collection.

Prediction condition quartile and quartile interval

Using quantile regression, the conditional quartiles of 50 equidistant values in the T range were estimated.


`Quartile is a 500 × 3 conditional quartile matrix. Rows correspond to observations in T and columns correspond to probabilities.
Plot conditional mean and median dependent variables on the scatter diagram of data`

plot(pred,[quartiles(:,2) meanY]);
Legend ('data ',' simulated outliers', 'median dependent variable', 'average dependent variable'

Matlab uses quantile random forest (QRF) regression tree to detect outliers

Although the conditional mean is close to the median curve, the simulated outliers will affect the mean curve.
Calculate the conditions IQR, F1 and F2.

 iqr = quartiles(:,3) - quartiles(:,1);
f1 = quartiles(:,1) - k*iqr;

K = 1.5 means that all observations less than F1 or greater than F2 are considered outliers, but this threshold cannot be distinguished from extreme outliers. When k is 3, extreme outliers can be determined.

Compare the observations with the boundary

Draw observation map and boundary.

Legend ('data ',' simulated outliers', 'f_1', 'f_2');
Title ('outlier detection using quantile regression ')

Matlab uses quantile random forest (QRF) regression tree to detect outliers

All simulated outliers are outside [F1, F2], and some observed values are also outside this interval.

Matlab uses quantile random forest (QRF) regression tree to detect outliers

Most popular insights

1.Why employees leave from decision tree model

2.R language tree based method: decision tree, random forest

3.Using scikit learn and pandas decision trees in Python

4.Machine learning: running random forest data analysis reports in SAS

5.R language uses random forest and text mining to improve airline customer satisfaction

6.Machine learning boosts fast fashion and accurate sales time series

7.Recognition of changing stock market conditions by machine learning — Application of hidden Markov model

8.Python machine learning: implementation of recommendation system (collaborative filtering by matrix decomposition)

9.Predicting bank customer churn using Python machine learning classification in Python