Original link:http://tecdat.cn/?p=14121
This paper will analyze several smoothing techniques used to formulate insurance premium rates.
-
Premium not broken down
The price should be related to the pure premium, which is proportional to the frequency because
No covariates, expected frequency should be
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5033 -0.3719 -0.2588 -0.1376 13.2700
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6201 0.0228 -114.9 <2e-16 ***
\-\-\-
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 12680 on 49999 degrees of freedom
Residual deviance: 12680 on 49999 degrees of freedom
AIC: 16353
Number of Fisher Scoring iterations: 6
> exp(coefficients(regglm0))
(Intercept)
0.07279295
Therefore, if we do not want to consider the potential heterogeneity, we usuallyAs a percentage, i.e. probability, because
NamelyIt can be explained that there is no possibility of claim. Let’s visualize it as a function of driver age,
> plot(a,yp0,type="l",ylim=c(.03,.12))
> segments(a\[k\],yp1\[k\],a\[k\],yp2\[k\],col="red",lwd=3)
We do predict the same frequency for all drivers, for example, for 40 year olds,
> cat("Frequency =",yp0\[k\]," confidence interval",yp1\[k\],yp2\[k\])
Frequency = 0.07279295 confidence interval 0.07611196 0.06947393
Now let’s consider a situation where we try to consider heterogeneity, for example, by age,
- Poisson regression
The idea of (logarithmic) Poisson regression is hypothetical rather than, we should have
Where
Here, let us consider only one explanatory variable, namely
We have
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> points(a\[k\],yp0\[k\],pch=3,lwd=3,col="red")
> segments(a\[k\],yp1\[k\],a\[k\],yp2\[k\],col="red",lwd=3)
The forecast for our 40 year old drivers’ annualized claim frequency is now 7.74% (slightly higher than our previous 7.28%)
> cat("Frequency =",yp0\[k\]," confidence interval",yp1\[k\],yp2\[k\])
Frequency = 0.07740574 confidence interval 0.08117512 0.07363636
Instead of calculating the expected frequency, calculate the ratio。
Above the horizontal blue line, the premium will be higher than the non segmented premium and lower than this level. Here, drivers younger than 44 will pay more, while drivers older than 44 will pay less. In the introduction, we discuss the necessity of segmentation. If we consider two companies, one segment and the other segment is flat, then older drivers will go to the first company (because insurance is cheaper) and younger drivers will go to the second company (again, it is cheaper). The problem is that the second company secretly hopes that the old drivers can make up for this risk. But since they no longer exist, the insurance price will be too cheap, and the company will relax its capital (if it does not go bankrupt). Therefore, companies must use segmentation technology to survive. Now, the problem is that we are not sure that this exponential decay of the premium is the correct way for the premium to change with age. An alternative approach is to use nonparametric techniques to visualize the effect of age on claim frequency_ True_ Impact.
- Pure nonparametric model
The first model can consider the premium of each age. Consider the age of the driver as a regression_ Factor _,
> plot(a0,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
Here, our 40 year old driver’s prediction is slightly lower than the previous one, but the confidence interval is much larger (because we focus on a very small category in the portfolio: drivers whose age \u happens to be \u 40)
Frequency = 0.06686658 confidence interval 0.08750205 0.0462311
Here, we think that the category is too small and the premium is too unstable: the premium will decrease by 20% from the age of 40 to 41, and then increase by 50% from the age of 41 to 42.
> diff(log(yp0\[23:25\]))
24 25
-0.2330241 0.5223478
The company has no opportunity to adopt this strategy to ensure the insured. This kind of premium_ Discontinuity_ Is an important issue here.
- Age group used
Another option is to consider the age range, from very young drivers to senior drivers.
> summary(regglmc1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6036 0.1741 -9.212 < 2e-16 ***
cut(ageconducteur, level1)(20,25\] -0.4200 0.1948 -2.157 0.0310 *
cut(ageconducteur, level1)(25,30\] -0.9378 0.1903 -4.927 8.33e-07 ***
cut(ageconducteur, level1)(30,35\] -1.0030 0.1869 -5.367 8.02e-08 ***
cut(ageconducteur, level1)(35,40\] -1.0779 0.1866 -5.776 7.65e-09 ***
cut(ageconducteur, level1)(40,45\] -1.0264 0.1858 -5.526 3.28e-08 ***
cut(ageconducteur, level1)(45,50\] -0.9978 0.1856 -5.377 7.58e-08 ***
cut(ageconducteur, level1)(50,55\] -1.0137 0.1855 -5.464 4.65e-08 ***
cut(ageconducteur, level1)(55,60\] -1.2036 0.1939 -6.207 5.40e-10 ***
cut(ageconducteur, level1)(60,65\] -1.1411 0.2008 -5.684 1.31e-08 ***
cut(ageconducteur, level1)(65,70\] -1.2114 0.2085 -5.811 6.22e-09 ***
cut(ageconducteur, level1)(70,75\] -1.3285 0.2210 -6.012 1.83e-09 ***
cut(ageconducteur, level1)(75,80\] -0.9814 0.2271 -4.321 1.55e-05 ***
cut(ageconducteur, level1)(80,85\] -1.4782 0.3371 -4.385 1.16e-05 ***
cut(ageconducteur, level1)(85,90\] -1.2120 0.5294 -2.289 0.0221 *
cut(ageconducteur, level1)(90,95\] -0.9728 1.0150 -0.958 0.3379
cut(ageconducteur, level1)(95,100\] -11.4694 144.2817 -0.079 0.9366
\-\-\-
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> lines(a,yp1,lty=2,type="s")
> lines(a,yp2,lty=2,type="s")
Here, we get the following predictions,
For our 40 year old drivers, the current frequency is 6.84%.
Frequency = 0.0684573 confidence interval 0.07766717 0.05924742
We should consider other categories to see if the forecast is value sensitive,
For our 40 year old driver, the following values are obtained:
Frequency = 0.07050614 confidence interval 0.07980422 0.06120807
So here, we have not eliminated_ Discontinuity_ Questions. One idea here is to consider_ Move area _: If the goal is to predict the frequency of 40 year olds, it should be centered on 40. For 35 year old drivers, the interval should be centered on 35.
- moving average
So consider some_ Local_ Regression is natural, only age should be considered_ Approach_ A 40 year old driver. This_ Almost_ And_ Bandwidth related. For example, a driver between 35 and 45 may be considered close to 40. In practice, we can consider subset functions and use weights in regression
> value=40
> h=5
To see what happens, let’s consider an animation in which the age of interest is changing,
Here, for our 40 year olds,
Frequency = 0.06913391 confidence interval 0.07535564 0.06291218
We have obtained that can be explained as_ Local_ Regression curve. But here, we do not consider that 35 is not as close to 40 as 39. Here 34 assumes a long distance from 40. Obviously, we can improve the technology: we can consider the kernel function, that is, the closer to 40, the greater the weight.
> value=40
> h=5
Draw below
Here, our prediction for 40 is
Frequency = 0.07040464 confidence interval 0.07981521 0.06099408
This is_ Nuclear regression technique_ Thought of. However, as mentioned in the slide, other nonparametric techniques, such as spline functions, can be considered.
- Smooth with spline
In R, using splines is simple (somewhat simpler than kernel smoother)
> library(splines)
Now the prediction for our 40 year old driver is
Frequency = 0.06928169 confidence interval 0.07397124 0.06459215
Note that this technology is different from another_ Model related _, The so-called generalized additive model, or gam.
This prediction is very close to the prediction we obtained above (the main difference is the very old driver)
Frequency = 0.06912683 confidence interval 0.07501663 0.06323702
- Comparison of different models
Either way, all these models are valid. So maybe we should compare them,
In the figure above, we can visualize the upper and lower prediction limits of the nine models. The horizontal line is the predicted value without considering the heterogeneity.
References
1.Estimation of HLM hierarchical linear model by SPSS
3.Lmer mixed linear regression model based on R language
4.Bayesian simple linear regression simulation analysis of Gibbs sampling in R language
5.Using gam (generalized additive model) to analyze power load time series in R language
6.Hierarchical linear model HLM using SAS, Stata, HLM, R, SPSS and Mplus
8.Prediction of air quality ozone data with linear regression model using R language