Probabilistic linear regression with uncertain weights


By Ruben winastwan
Compile VK
Source: towards Data Science

When you study data science and machine learning, linear regression may be the first statistical method you come across. I guess it’s not the first time you’ve used linear regression. Therefore, in this paper, I want to discuss probabilistic linear regression rather than typical / deterministic linear regression.

But before that, let’s briefly discuss the concept of deterministic linear regression, so as to quickly understand the main points of this paper.

Linear regression is a basic statistical method, which is used to establish the linear relationship between one or more input variables (or independent variables) and one or more output variables (or dependent variables).

Where a is the intercept and B is the slope. X is the independent variable, y is the dependent variable, which is the value we want to predict.

The values of a and B need to be optimized by gradient descent algorithm. Then, we get the most suitable regression line between the independent variable and the dependent variable. Through regression line, we can predict the value of y of any input X. These are the steps of how to establish a typical or deterministic linear regression algorithm.

However, this deterministic linear regression algorithm can not really describe the data. Why?

In fact, when we do linear regression analysis, there are two kinds of uncertainties

  • Arbitrary uncertainty is the uncertainty generated by data.
  • Cognitive uncertainty, which is generated from the regression model.

I will elaborate on these uncertainties in this article. Considering these uncertainties, probability linear regression should be used instead of deterministic linear regression.

In this paper, we will discuss probabilistic linear regression and its difference from deterministic linear regression. We will first see how deterministic linear regression is built in tensorflow, and then we will continue to build a probabilistic linear regression model with tensorflow probability.

First, let’s start by loading the dataset we’ll use in this article.

Loading and preprocessing data

The data set to be used in this paper is the mpg data set of car. As usual, we can use panda to load data.

import pandas as pd

auto_data = pd.read_csv('auto-mpg.csv')

The following is a statistical summary of the data.

Next, we can use the following code to see the correlation between variables in the dataset.

import matplotlib.pyplot as plt
import seaborn as sns

corr_df = auto_data.corr()

sns.heatmap(corr_df, cmap="YlGnBu", annot = True)

Now if we look at correlation, there is a strong negative correlation between mpg and weight.

In this paper, for the purpose of visualization, I will do a simple linear regression analysis. The independent variable is the weight of the car and the dependent variable is the mpg of the car.

Now, let’s use scikit learn to decompose the data into training data and test data. After splitting the data, we can now scale the dependent and independent variables. This is to ensure that the two variables are in the same range, which will also improve the convergence speed of the linear regression model.

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

x = auto_data['weight']
y = auto_data['mpg']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=5)

min_max_scaler = preprocessing.MinMaxScaler()

x_train_minmax = min_max_scaler.fit_transform(x_train.values.reshape(len(x_train),1))
y_train_minmax = min_max_scaler.fit_transform(y_train.values.reshape(len(y_train),1))
x_test_minmax = min_max_scaler.fit_transform(x_test.values.reshape(len(x_test),1))
y_test_minmax = min_max_scaler.fit_transform(y_test.values.reshape(len(y_test),1))

Now if we visualize the training data, we get the following visualization:

fantastic! Next, let’s continue to use tensorflow to build our deterministic linear regression model.

Deterministic linear regression based on tensorflow

It is very easy to build a simple linear regression model with tensorflow. All we need to do is build a single fully connected layer model without any activation functions. For the cost function, the mean square error is usually used. In this case, I will use rmsprop as the optimizer, and the model will be trained in 100 epochs. We can build and train the model with the following lines of code.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.losses import MeanSquaredError

model = Sequential([
    Dense(units=1, input_shape=(1,))

model.compile(loss=MeanSquaredError(), optimizer=RMSprop(learning_rate=0.01))
history =, y_train_minmax, epochs=100, verbose=False)

After we train the model, let’s look at the loss of the model to check the convergence of the loss.

def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')

It seems that the losses have converged. Now, if we use the trained model to predict the test set, we can see the regression line below.

this is it. We’re done!

As I mentioned earlier, it’s very easy to build a simple linear regression model using tensorflow. With regression lines, we can now approximate the mpg of a car in any given car weight input. For example, let’s assume that the vehicle weight after feature scaling is 0.64. By passing this value to the trained model, we can get the corresponding mpg value of the car, as shown below.

Now you can see that the mpg of the car predicted by the model is 0.21. Simply put, for any given vehicle weight, we get a certain mpg value of the vehicle

However, the output does not account for all the problems. Here we should pay attention to two things. First, we have limited data points. Second, as we can see from the linear regression graph, most of the data points are not really on the regression line.

Although we get the output value of 0.21, we know that the mpg of the actual car is not exactly 0.21. It can be a little lower or a little higher. In other words, uncertainty needs to be taken into account. This kind of uncertainty is called arbitrary uncertainty.

Deterministic linear regression cannot capture any uncertainty of data. In order to capture this arbitrary uncertainty, probabilistic linear regression can be used instead.

TensorFlowProbability linear regression of probability

Because of tensorflow probability, it is very easy to establish a probability linear regression model. However, you need to install tensorflow first_ Probability library. You can install it using the PIP command, as follows:

pip install tensorflow_probability

The prerequisite for installing this library is that you have tensorflow version 2.3.0. Therefore, make sure to upgrade your version of tensorflow before installing tensorflow probability.

A probabilistic linear regression model is established for uncertainty

In this section, we will build a probabilistic linear regression model considering uncertainty.

This model is very similar to deterministic linear regression. However, instead of using a single fully connected layer, we need to add another layer as the last one. The last layer transforms the final output value from certainty to probability distribution.

In this case, we’ll create the last layer, which converts the output values to the probability values of the normal distribution. Here is the implementation.

import tensorflow_probability as tfp
import tensorflow as tf
tfd = tfp.distributions
tfpl = tfp.layers

model = Sequential([
  Dense(units=1+1, input_shape=(1,)),
      lambda t: tfd.Normal(loc=t[..., :1],

Note that we applied an additional layer tensorflow probability layer at the end. In this layer, the two outputs of the previous fully connected layer (one is the mean value and the other is the standard deviation) are transformed into probability values with normal distribution of the trainable mean (LOC) and standard deviation (scale).

We can use rmsprop as the optimizer, but you can use other optimizers if you like. For the loss function, we need to use negative log likelihood.

But why do we use negative log likelihood as a loss function

Negative log likelihood as a cost function

In order to fit a distribution to some data, we need to use likelihood function. Through the likelihood function, we try to estimate the unknown parameters in the given data (for example, the mean and standard deviation of normal distribution data).

In our probabilistic regression model, the job of the optimizer is to find the maximum likelihood estimation of unknown parameters. In other words, we train the model to find the most likely parameter value from our data.

Maximum likelihood estimation is the same as minimum negative log likelihood. In the field of optimization, the goal is usually to minimize the cost rather than maximize the cost. That’s why we use negative log likelihood as a cost function.

Here is the implementation of negative log likelihood as our custom loss function.

def negative_log_likelihood(y_true, y_pred):
    return -y_pred.log_prob(y_true)

Training and prediction results of stochastic uncertainty probability linear regression model

Now that we have built the model and defined the optimizer and loss function, let’s compile and train the model.

model.compile(optimizer=RMSprop(learning_rate=0.01), loss=negative_log_likelihood)
history =, y_train_minmax, epochs=200, verbose=False);

Now we can take samples from the trained model. We can visualize the comparison between the test set and the examples generated from the model through the following code.

y_model = model(x_test_minmax)
y_sample = y_model.sample()

plt.scatter(x_test_minmax, y_test_minmax, alpha=0.5, label='test data')
plt.scatter(x_test_minmax, y_sample, alpha=0.5, color='green', label='model sample')

As you can see from the above visualization, the model now does not return deterministic values for any given input value. Instead, it returns a distribution and draws a sample based on that distribution.

If you compare the data points of the test set (blue dots) with the data points predicted by the training model (green dots), you may think that the green dots and blue dots come from the same distribution.

Next, we can also visualize the mean and standard deviation of the distribution generated by the training model, given the data in the training set. We can do this by applying the following code.

y_mean = y_model.mean()
y_sd = y_model.stddev()
y_mean_m2sd = y_mean - 2 * y_sd
y_mean_p2sd = y_mean + 2 * y_sd

plt.scatter(x_test_minmax, y_test_minmax, alpha=0.4, label='data')
plt.plot(x_test_minmax, y_mean, color='red', alpha=0.8, label='model $\mu$')
plt.plot(x_test_minmax, y_mean_m2sd, color='green', alpha=0.8, label='model $\mu \pm 2 \sigma$')
plt.plot(x_test_minmax, y_mean_p2sd, color='green', alpha=0.8)

We can see that the probability linear regression model gives us more than the regression line. It also gives an approximation of the standard deviation of the data. It can be seen that about 95% of the test set data points are within two standard deviations.

The probability linear regression model of stochastic and cognitive uncertainty is established

So far, we have established a probabilistic regression model, which considers the uncertainty from the data, or we call it arbitrary uncertainty.

However, in reality, we still need to deal with the uncertainty from the regression model itself. Due to the imperfection of data, the weight or slope of regression parameters are also uncertain. This kind of uncertainty is called cognitive uncertainty.

So far, our probabilistic model only considers a certain weight. As you can see from the visualization, the model generates only one regression line, and this is usually not completely accurate.

In this section, we will improve the probabilistic regression model considering both arbitrary and cognitive uncertainties. We can use Bayesian viewpoint to introduce the uncertainty of regression weight.

First of all, before we look at the data, we need to define our previous view of the weight distribution. Usually, we don’t know what’s going to happen, do we? For simplicity, we assume that the distribution of weights is normal, the mean value is 0, and the standard deviation is 1.

def prior(kernel_size, bias_size, dtype=None):
  n = kernel_size + bias_size
  return Sequential([
      tfpl.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=tf.zeros(n), scale=tf.ones(n))))

Because we hard code the mean and standard deviation, this priori is untrained.

Next, we need to define the posterior distribution of regression weights. A posteriori distribution shows how our beliefs change when we see patterns in the data. Therefore, the parameters in the posterior distribution are trainable. Here is the code implementation that defines the posterior distribution.

def posterior(kernel_size, bias_size, dtype=None):
  n = kernel_size + bias_size
  return Sequential([
      tfpl.VariableLayer(2 * n, dtype=dtype),
      tfpl.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t[..., :n],
                     scale=tf.nn.softplus(t[..., n:]))))

Now the question is, what is the definition of variables in this posterior function? The idea behind this variable layer is that we try to approach the real posterior distribution. In general, it is impossible to derive the true posterior distribution, so we need to approximate it.

After defining a priori function and a posteriori function, we can establish a probabilistic linear regression model with weight uncertainty. Here is the code implementation.

model = Sequential([
    tfpl.DenseVariational(units = 1 + 1, 
                          make_prior_fn = prior,
                          make_posterior_fn = posterior,
      lambda t: tfd.Normal(loc=t[..., :1],

As you may notice, the only difference between this model and the previous probabilistic regression model is the first level. We use the densevariational layer instead of the normal full join layer. In this layer, we use the preceding and following functions as parameters. The second layer is exactly the same as the previous model.

Training and prediction results of probabilistic linear regression models with stochastic and cognitive uncertainties

Now it’s time to compile and train the model.

The optimizer and cost function remain the same as the previous model. We use rmsprop as the optimizer and negative log likelihood as our cost function. Let’s compile and train.

model.compile(optimizer= RMSprop(learning_rate=0.01), loss=negative_log_likelihood)
history =, y_train_minmax, epochs=500, verbose=False);

It’s time to visualize the uncertainty of the weight or slope of the regression model. The following is the code implementation of the visualization results.

plt.scatter(x_test_minmax, y_test_minmax, marker='.', alpha=0.8, label='data')
for i in range(10):
    y_model = model(x_test_minmax)
    y_mean = y_model.mean()
    y_mean_m2sd = y_mean - 2 * y_model.stddev()
    y_mean_p2sd = y_mean + 2 * y_model.stddev()
    if i == 0:
        plt.plot(x_test_minmax, y_mean, color='red', alpha=0.8, label='model $\mu$')
        plt.plot(x_test_minmax, y_mean_m2sd, color='green', alpha=0.8, label='model $\mu \pm 2 \sigma$')
        plt.plot(x_test_minmax, y_mean_p2sd, color='green', alpha=0.8)
        plt.plot(x_test_minmax, y_mean, color='red', alpha=0.8)
        plt.plot(x_test_minmax, y_mean_m2sd, color='green', alpha=0.8)
        plt.plot(x_test_minmax, y_mean_p2sd, color='green', alpha=0.8)

In the above visualization, you can see that the linear line (mean) and standard deviation generated by the posterior distribution of the trained model are different in each iteration. All these lines are reasonable solutions to fit the data points in the test set. But because of cognitive uncertainty, we don’t know which line is the best.

Generally, the more data points we have, the less uncertainty we see in the regression line.


Now you’ve seen the difference between probabilistic linear regression and deterministic linear regression. In probabilistic linear regression, two kinds of uncertainty arising from data (arbitrary) and regression model (cognitive) can be considered.

If we want to build a deep learning model, so that inaccurate predictions lead to very serious negative consequences, such as in the field of autonomous driving and medical diagnosis, it is very important to consider these uncertainties.

Generally, when we have more data points, the cognitive uncertainty of the model will be reduced.

Link to the original text:

Welcome to panchuang AI blog:

Sklearn machine learning official Chinese document:

Welcome to pancreato blog Resource Hub: