R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Time:2022-5-31

Original link:http://tecdat.cn/?p=14121

This paper will analyze several smoothing techniques used to formulate insurance premium rates.

The price should be related to the pure premium, which is proportional to the frequency because

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

No covariates, expected frequency should be

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5033  -0.3719  -0.2588  -0.1376  13.2700  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6201     0.0228  -114.9   <2e-16 ***
\-\-\-
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 12680  on 49999  degrees of freedom
Residual deviance: 12680  on 49999  degrees of freedom
AIC: 16353

Number of Fisher Scoring iterations: 6
> exp(coefficients(regglm0))
(Intercept) 
 0.07279295

Therefore, if we do not want to consider the potential heterogeneity, we usuallyR language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regressionAs a percentage, i.e. probability, because

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

NamelyR language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regressionIt can be explained that there is no possibility of claim. Let’s visualize it as a function of driver age,



  > plot(a,yp0,type="l",ylim=c(.03,.12))




  
> segments(a\[k\],yp1\[k\],a\[k\],yp2\[k\],col="red",lwd=3)

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

We do predict the same frequency for all drivers, for example, for 40 year olds,

> cat("Frequency =",yp0\[k\]," confidence interval",yp1\[k\],yp2\[k\])
Frequency = 0.07279295  confidence interval 0.07611196 0.06947393

Now let’s consider a situation where we try to consider heterogeneity, for example, by age,

  • Poisson regression

The idea of (logarithmic) Poisson regression is hypothetical rather thanR language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression, we should haveR language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regressionWhere

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Here, let us consider only one explanatory variable, namely

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

We have

> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> points(a\[k\],yp0\[k\],pch=3,lwd=3,col="red")
> segments(a\[k\],yp1\[k\],a\[k\],yp2\[k\],col="red",lwd=3)

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

The forecast for our 40 year old drivers’ annualized claim frequency is now 7.74% (slightly higher than our previous 7.28%)

> cat("Frequency =",yp0\[k\]," confidence interval",yp1\[k\],yp2\[k\])
Frequency = 0.07740574  confidence interval 0.08117512 0.07363636

Instead of calculating the expected frequency, calculate the ratioR language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Above the horizontal blue line, the premium will be higher than the non segmented premium and lower than this level. Here, drivers younger than 44 will pay more, while drivers older than 44 will pay less. In the introduction, we discuss the necessity of segmentation. If we consider two companies, one segment and the other segment is flat, then older drivers will go to the first company (because insurance is cheaper) and younger drivers will go to the second company (again, it is cheaper). The problem is that the second company secretly hopes that the old drivers can make up for this risk. But since they no longer exist, the insurance price will be too cheap, and the company will relax its capital (if it does not go bankrupt). Therefore, companies must use segmentation technology to survive. Now, the problem is that we are not sure that this exponential decay of the premium is the correct way for the premium to change with age. An alternative approach is to use nonparametric techniques to visualize the effect of age on claim frequency_ True_ Impact.

  • Pure nonparametric model

The first model can consider the premium of each age. Consider the age of the driver as a regression_ Factor _,

> plot(a0,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Here, our 40 year old driver’s prediction is slightly lower than the previous one, but the confidence interval is much larger (because we focus on a very small category in the portfolio: drivers whose age \u happens to be \u 40)

Frequency = 0.06686658  confidence interval 0.08750205 0.0462311

Here, we think that the category is too small and the premium is too unstable: the premium will decrease by 20% from the age of 40 to 41, and then increase by 50% from the age of 41 to 42.

> diff(log(yp0\[23:25\]))
        24         25 
-0.2330241  0.5223478

The company has no opportunity to adopt this strategy to ensure the insured. This kind of premium_ Discontinuity_ Is an important issue here.

  • Age group used

Another option is to consider the age range, from very young drivers to senior drivers.

> summary(regglmc1)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.6036     0.1741  -9.212  < 2e-16 ***
cut(ageconducteur, level1)(20,25\]   -0.4200     0.1948  -2.157   0.0310 *  
cut(ageconducteur, level1)(25,30\]   -0.9378     0.1903  -4.927 8.33e-07 ***
cut(ageconducteur, level1)(30,35\]   -1.0030     0.1869  -5.367 8.02e-08 ***
cut(ageconducteur, level1)(35,40\]   -1.0779     0.1866  -5.776 7.65e-09 ***
cut(ageconducteur, level1)(40,45\]   -1.0264     0.1858  -5.526 3.28e-08 ***
cut(ageconducteur, level1)(45,50\]   -0.9978     0.1856  -5.377 7.58e-08 ***
cut(ageconducteur, level1)(50,55\]   -1.0137     0.1855  -5.464 4.65e-08 ***
cut(ageconducteur, level1)(55,60\]   -1.2036     0.1939  -6.207 5.40e-10 ***
cut(ageconducteur, level1)(60,65\]   -1.1411     0.2008  -5.684 1.31e-08 ***
cut(ageconducteur, level1)(65,70\]   -1.2114     0.2085  -5.811 6.22e-09 ***
cut(ageconducteur, level1)(70,75\]   -1.3285     0.2210  -6.012 1.83e-09 ***
cut(ageconducteur, level1)(75,80\]   -0.9814     0.2271  -4.321 1.55e-05 ***
cut(ageconducteur, level1)(80,85\]   -1.4782     0.3371  -4.385 1.16e-05 ***
cut(ageconducteur, level1)(85,90\]   -1.2120     0.5294  -2.289   0.0221 *  
cut(ageconducteur, level1)(90,95\]   -0.9728     1.0150  -0.958   0.3379    
cut(ageconducteur, level1)(95,100\] -11.4694   144.2817  -0.079   0.9366    
\-\-\-
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 








> lines(a,yp1,lty=2,type="s")
> lines(a,yp2,lty=2,type="s")

Here, we get the following predictions,

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

For our 40 year old drivers, the current frequency is 6.84%.

Frequency = 0.0684573  confidence interval 0.07766717 0.05924742

We should consider other categories to see if the forecast is value sensitive,

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

For our 40 year old driver, the following values are obtained:

Frequency = 0.07050614  confidence interval 0.07980422 0.06120807

So here, we have not eliminated_ Discontinuity_ Questions. One idea here is to consider_ Move area _: If the goal is to predict the frequency of 40 year olds, it should be centered on 40. For 35 year old drivers, the interval should be centered on 35.

  • moving average

So consider some_ Local_ Regression is natural, only age should be considered_ Approach_ A 40 year old driver. This_ Almost_ And_ Bandwidth related. For example, a driver between 35 and 45 may be considered close to 40. In practice, we can consider subset functions and use weights in regression

> value=40
> h=5

To see what happens, let’s consider an animation in which the age of interest is changing,

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Here, for our 40 year olds,

Frequency = 0.06913391  confidence interval 0.07535564 0.06291218

We have obtained that can be explained as_ Local_ Regression curve. But here, we do not consider that 35 is not as close to 40 as 39. Here 34 assumes a long distance from 40. Obviously, we can improve the technology: we can consider the kernel function, that is, the closer to 40, the greater the weight.

> value=40
> h=5

Draw below

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Here, our prediction for 40 is

Frequency = 0.07040464  confidence interval 0.07981521 0.06099408

This is_ Nuclear regression technique_ Thought of. However, as mentioned in the slide, other nonparametric techniques, such as spline functions, can be considered.

  • Smooth with spline

In R, using splines is simple (somewhat simpler than kernel smoother)

> library(splines)

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Now the prediction for our 40 year old driver is

Frequency = 0.06928169  confidence interval 0.07397124 0.06459215

Note that this technology is different from another_ Model related _, The so-called generalized additive model, or gam.

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

This prediction is very close to the prediction we obtained above (the main difference is the very old driver)

Frequency = 0.06912683  confidence interval 0.07501663 0.06323702
  • Comparison of different models

Either way, all these models are valid. So maybe we should compare them,

R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

In the figure above, we can visualize the upper and lower prediction limits of the nine models. The horizontal line is the predicted value without considering the heterogeneity.


R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

References

1.Estimation of HLM hierarchical linear model by SPSS

2.R language linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and canonical discriminant analysis (RDA)

3.Lmer mixed linear regression model based on R language

4.Bayesian simple linear regression simulation analysis of Gibbs sampling in R language

5.Using gam (generalized additive model) to analyze power load time series in R language

6.Hierarchical linear model HLM using SAS, Stata, HLM, R, SPSS and Mplus

7.Ridge regression, Lasso regression and principal component regression in R language: linear model selection and regularization

8.Prediction of air quality ozone data with linear regression model using R language

9.Case of R language hierarchical linear model

Recommended Today

Network counting experiment I Division VLAN

Experiment 1  vlanCreation and division of 1、 Experiment purpose: 1. Understand the working principle of VLAN; 2. Learn the method of dividing VLANs based on ports; 3. Understand the communication between the same VLANs across switches; 4. Further learn the configuration commands of switch ports. 2、 Experimental principle: VLAN (virtual local area network), that is, […]