This example shows how to use quantile random forest to detect outliers. Quantile random forests can detect outliers related to the conditional distribution of Y for a given X.
Outliers are some observations whose position is far enough from most other observations in the data set to be considered abnormal. The causes of outlier observation include inherent variability or measurement error. Outliers significantly affect estimation and inference, so it is very important to detect them to decide whether to delete or robust analysis.
To demonstrate outlier detection, this example:
Data are generated from nonlinear models with heteroscedasticity, and some outliers are simulated.
Quantile random forest growth regression tree.
Estimate the conditional quartiles (Q1, Q2 and Q3) and_ Quartile_ Distance (IQR).
The observed values are compared with the boundary, which are F1 = Q1 − 1.5iqr and F2 = Q3 + 1.5iqr. Any observation less than F1 or greater than F2 is an outlier.
Generate 500 observations from the model
Evenly distributed between 0 ~ 4 π, ε T is about n (0, t + 0.01). Store data in a table.
rng('default'); % To ensure repeatability randsample(linspace(0,4*pi,1e6),n,true)'; epsilon = randn(n,1).*sqrt((t+0.01));
Move the five observations 90% in a random vertical direction.
numOut = 5; Tbl.y(idx) + randsample([-1 1],numOut,true)'.*(0.9*Tbl.y(idx));
Plot a scatter plot of the data and identify outliers.
plot(Tbl.t,Tbl.y,'.'); plot(Tbl.t(idx),Tbl.y(idx),'*'); Title ('data scatter diagram '); Legend ('data ',' simulated outliers', 'location', 'Northwest');
Generate quantile random forest
Generate 200 regression trees.
The return is a treebagger collection.
Prediction condition quartile and quartile interval
Using quantile regression, the conditional quartiles of 50 equidistant values in the T range were estimated.
`Quartile is a 500 × 3 conditional quartile matrix. Rows correspond to observations in T and columns correspond to probabilities.
Plot conditional mean and median dependent variables on the scatter diagram of data`
plot(pred,[quartiles(:,2) meanY]); Legend ('data ',' simulated outliers', 'median dependent variable', 'average dependent variable'
Although the conditional mean is close to the median curve, the simulated outliers will affect the mean curve.
Calculate the conditions IQR, F1 and F2.
iqr = quartiles(:,3) - quartiles(:,1); f1 = quartiles(:,1) - k*iqr;
K = 1.5 means that all observations less than F1 or greater than F2 are considered outliers, but this threshold cannot be distinguished from extreme outliers. When k is 3, extreme outliers can be determined.
Compare the observations with the boundary
Draw observation map and boundary.
plot(Tbl.t,Tbl.y,'.'); Legend ('data ',' simulated outliers', 'f_1', 'f_2'); Title ('outlier detection using quantile regression ')
All simulated outliers are outside [F1, F2], and some observed values are also outside this interval.
Most popular insights