# R language nonparametric model to determine premium rate: local regression, generalized additive model GAM, spline regression

Time：2022-5-31

• ### Premium not broken down

The price should be related to the pure premium, which is proportional to the frequency because

No covariates, expected frequency should be

``````Deviance Residuals:
Min       1Q   Median       3Q      Max
-0.5033  -0.3719  -0.2588  -0.1376  13.2700

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -2.6201     0.0228  -114.9   <2e-16 ***
\-\-\-
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 12680  on 49999  degrees of freedom
Residual deviance: 12680  on 49999  degrees of freedom
AIC: 16353

Number of Fisher Scoring iterations: 6
> exp(coefficients(regglm0))
(Intercept)
0.07279295``````

Therefore, if we do not want to consider the potential heterogeneity, we usuallyAs a percentage, i.e. probability, because

NamelyIt can be explained that there is no possibility of claim. Let’s visualize it as a function of driver age,

``````

> plot(a,yp0,type="l",ylim=c(.03,.12))

> segments(a\[k\],yp1\[k\],a\[k\],yp2\[k\],col="red",lwd=3)``````

We do predict the same frequency for all drivers, for example, for 40 year olds,

``````> cat("Frequency =",yp0\[k\]," confidence interval",yp1\[k\],yp2\[k\])
Frequency = 0.07279295  confidence interval 0.07611196 0.06947393``````

Now let’s consider a situation where we try to consider heterogeneity, for example, by age,

• Poisson regression

The idea of (logarithmic) Poisson regression is hypothetical rather than, we should haveWhere

Here, let us consider only one explanatory variable, namely

We have

``````> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> points(a\[k\],yp0\[k\],pch=3,lwd=3,col="red")
> segments(a\[k\],yp1\[k\],a\[k\],yp2\[k\],col="red",lwd=3)``````

The forecast for our 40 year old drivers’ annualized claim frequency is now 7.74% (slightly higher than our previous 7.28%)

``````> cat("Frequency =",yp0\[k\]," confidence interval",yp1\[k\],yp2\[k\])
Frequency = 0.07740574  confidence interval 0.08117512 0.07363636``````

Instead of calculating the expected frequency, calculate the ratio

Above the horizontal blue line, the premium will be higher than the non segmented premium and lower than this level. Here, drivers younger than 44 will pay more, while drivers older than 44 will pay less. In the introduction, we discuss the necessity of segmentation. If we consider two companies, one segment and the other segment is flat, then older drivers will go to the first company (because insurance is cheaper) and younger drivers will go to the second company (again, it is cheaper). The problem is that the second company secretly hopes that the old drivers can make up for this risk. But since they no longer exist, the insurance price will be too cheap, and the company will relax its capital (if it does not go bankrupt). Therefore, companies must use segmentation technology to survive. Now, the problem is that we are not sure that this exponential decay of the premium is the correct way for the premium to change with age. An alternative approach is to use nonparametric techniques to visualize the effect of age on claim frequency_ True_ Impact.

• Pure nonparametric model

The first model can consider the premium of each age. Consider the age of the driver as a regression_ Factor _,

``````> plot(a0,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")``````

Here, our 40 year old driver’s prediction is slightly lower than the previous one, but the confidence interval is much larger (because we focus on a very small category in the portfolio: drivers whose age \u happens to be \u 40)

``Frequency = 0.06686658  confidence interval 0.08750205 0.0462311``

Here, we think that the category is too small and the premium is too unstable: the premium will decrease by 20% from the age of 40 to 41, and then increase by 50% from the age of 41 to 42.

``````> diff(log(yp0\[23:25\]))
24         25
-0.2330241  0.5223478``````

The company has no opportunity to adopt this strategy to ensure the insured. This kind of premium_ Discontinuity_ Is an important issue here.

• Age group used

Another option is to consider the age range, from very young drivers to senior drivers.

``````> summary(regglmc1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                         -1.6036     0.1741  -9.212  < 2e-16 ***
cut(ageconducteur, level1)(20,25\]   -0.4200     0.1948  -2.157   0.0310 *
cut(ageconducteur, level1)(25,30\]   -0.9378     0.1903  -4.927 8.33e-07 ***
cut(ageconducteur, level1)(30,35\]   -1.0030     0.1869  -5.367 8.02e-08 ***
cut(ageconducteur, level1)(35,40\]   -1.0779     0.1866  -5.776 7.65e-09 ***
cut(ageconducteur, level1)(40,45\]   -1.0264     0.1858  -5.526 3.28e-08 ***
cut(ageconducteur, level1)(45,50\]   -0.9978     0.1856  -5.377 7.58e-08 ***
cut(ageconducteur, level1)(50,55\]   -1.0137     0.1855  -5.464 4.65e-08 ***
cut(ageconducteur, level1)(55,60\]   -1.2036     0.1939  -6.207 5.40e-10 ***
cut(ageconducteur, level1)(60,65\]   -1.1411     0.2008  -5.684 1.31e-08 ***
cut(ageconducteur, level1)(65,70\]   -1.2114     0.2085  -5.811 6.22e-09 ***
cut(ageconducteur, level1)(70,75\]   -1.3285     0.2210  -6.012 1.83e-09 ***
cut(ageconducteur, level1)(75,80\]   -0.9814     0.2271  -4.321 1.55e-05 ***
cut(ageconducteur, level1)(80,85\]   -1.4782     0.3371  -4.385 1.16e-05 ***
cut(ageconducteur, level1)(85,90\]   -1.2120     0.5294  -2.289   0.0221 *
cut(ageconducteur, level1)(90,95\]   -0.9728     1.0150  -0.958   0.3379
cut(ageconducteur, level1)(95,100\] -11.4694   144.2817  -0.079   0.9366
\-\-\-
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> lines(a,yp1,lty=2,type="s")
> lines(a,yp2,lty=2,type="s")``````

Here, we get the following predictions,

For our 40 year old drivers, the current frequency is 6.84%.

``Frequency = 0.0684573  confidence interval 0.07766717 0.05924742``

We should consider other categories to see if the forecast is value sensitive,

For our 40 year old driver, the following values are obtained:

``Frequency = 0.07050614  confidence interval 0.07980422 0.06120807``

So here, we have not eliminated_ Discontinuity_ Questions. One idea here is to consider_ Move area _: If the goal is to predict the frequency of 40 year olds, it should be centered on 40. For 35 year old drivers, the interval should be centered on 35.

• moving average

So consider some_ Local_ Regression is natural, only age should be considered_ Approach_ A 40 year old driver. This_ Almost_ And_ Bandwidth related. For example, a driver between 35 and 45 may be considered close to 40. In practice, we can consider subset functions and use weights in regression

``````> value=40
> h=5``````

To see what happens, let’s consider an animation in which the age of interest is changing,

Here, for our 40 year olds,

``Frequency = 0.06913391  confidence interval 0.07535564 0.06291218``

We have obtained that can be explained as_ Local_ Regression curve. But here, we do not consider that 35 is not as close to 40 as 39. Here 34 assumes a long distance from 40. Obviously, we can improve the technology: we can consider the kernel function, that is, the closer to 40, the greater the weight.

``````> value=40
> h=5``````

Draw below

Here, our prediction for 40 is

``Frequency = 0.07040464  confidence interval 0.07981521 0.06099408``

This is_ Nuclear regression technique_ Thought of. However, as mentioned in the slide, other nonparametric techniques, such as spline functions, can be considered.

• Smooth with spline

In R, using splines is simple (somewhat simpler than kernel smoother)

``> library(splines)``

Now the prediction for our 40 year old driver is

``Frequency = 0.06928169  confidence interval 0.07397124 0.06459215``

Note that this technology is different from another_ Model related _, The so-called generalized additive model, or gam.

This prediction is very close to the prediction we obtained above (the main difference is the very old driver)

``Frequency = 0.06912683  confidence interval 0.07501663 0.06323702``
• Comparison of different models

Either way, all these models are valid. So maybe we should compare them,

In the figure above, we can visualize the upper and lower prediction limits of the nine models. The horizontal line is the predicted value without considering the heterogeneity.

References

## Network counting experiment I Division VLAN

Experiment 1  vlanCreation and division of 1、 Experiment purpose: 1. Understand the working principle of VLAN; 2. Learn the method of dividing VLANs based on ports; 3. Understand the communication between the same VLANs across switches; 4. Further learn the configuration commands of switch ports. 2、 Experimental principle: VLAN (virtual local area network), that is, […]