Bayesian Neural Networks: Bayes’ Theorem Applied to Deep Learning


The article was written by Amber Zhou, a Financial Analyst at I Know First.




  • Vanilla Deep Learning Method: Multilayer Perceptron (MLP)
  • Introduction to Bayes’ Theorem
  • Application of Bayes’ Theorem: Bayesian Inference
  • Combination: Bayesian Neural Networks
  • How I Know First Utilizes Bayesian Neural Networks for Forecasting

Vanilla Deep Learning Method

Deep learning has become a buzzward in recent years. In fact, it has once gained much attention and excitements under the name neural networks early back in 1980’s. However due to the lack of sufficient compute power and training examples, it gradually experienced a depression in the following decade. As we are entering the Era of Big Data in light of the explosion of computer power, deep learning has recently seen a revival. The major differences between now and then also come from some architectural and algorithmic innovations, significantly improved software tools as well as vastly increased industry investment.

As a starting point, a deep learning vanilla method called multilayer perceptron (MLP), or feed-forward neural networks, will be discussed here. MLP serves as a basic tool for both classification and regression, two fundamental tasks of machine learning. In other words, the model can be utilized to predict either a discrete class label output or a continuous quantity output.

Let’s begin with the revision of simple linear model, which is essentially computing and predicting the output by the weighted sum of a series of input features. The process could be visualized by the following graph:

Here, each node on the left represents an input feature, the connecting lines represent the learned coefficients, and the node on the right represents the output, which is a weighted sum of the inputs.

Moving forward, MLP can be simply considered as the process of repeating the weighted sum computation by multiple times to arrive at a final decision. More specifically, there will be hidden units serving as intermediate processing steps to yield final results. While there could be multiple layers in one model, a graph of single-layer MLP is shown below.

Mathematically, a simple combination of a series of weighted sums is equivalent to one single weighted sum. Therefore, besides the hidden layers, a nonlinear function, also known as activation or squashing function, is applied to each hidden unit after computing a weighted sum at each level to further empower the model. The squashing functions can take many forms and two of the most widely used ones are the rectifying nonlinearity (relu) and the tangens hyperbolicus (tanh). The results of these nonlinear functions then serve as inputs for the next layer’s weighted sum. The addition of squashing functions allows the neural network model to learn much more complex functions.

Along with the extra multiple layers and increasing predicting power, there will be exponentially more weights (or coefficient variables) to be learned – there is one set of weight between input and the first hidden layer, each adjacent layers and the last hidden layer and final output.

Introduction to Bayes’ Theorem

Leaving deep learning aside, we are diving into the statistics world for a moment. One of the most basic rules in probability theory is Bayes’ Theorem (or Bayes’ Rule), which expresses the probability of an event given the prior knowledge of related conditions. The formula is shown below, with H representing hypothesis and E representing evidence.

In plain English, it is essentially a tool to update the existing beliefs or predictions by the combination of new evidence. While it is simply an application of the conditional probability formula, it is a powerful approach to reason about various problems involving belief updates.

Naïve Bayes has been widely utilized in modern machine learning problems. For a quick basic example, Bayesian is applied to calculate the probability of an email being spam given the words contained in the email (P(H | E)) and past related knowledge, such as the probability of the appearance of specific words (P(E)), the probability of appearance of spam emails (P(H)) and the probability of appearance of specific words in the past spam emails. Most importantly, this can be a continuous learning process and could be updated and improved with every new email received.

In Bayesian world, P(H), the prior, is the initial degree of belief in H. P(E | H), represents the likelihood term. P (H | E), the posterior, is the degree of belief having accounted for the evidence.

Application of Bayes’ Theorem: Bayesian Inference

One of the most significant applications of Bayes’ theorem is Bayesian inference, which enables us to make more accurate predictions with more confidence by incorporating what you already know about the results beforehand.

In a more mathematical manner, the fundamental Bayes’ formula could be remodeled by distribution format rather than single numbers. In this context, the initial beliefs will be represented by the prior distribution and the final beliefs will be represented by the posterior distribution. In addition, the symbols in the formula could also be revised, as illustrated below, for the model form of Bayes’ Theorem.

Theta represents the set of parameters that we’re interested in and data represents the set of observations that we have. All Ps represent distributions, instead of simple probability figures.

By folding in the data that reflect certain facts are more likely than others, the Bayesian Inference is well positioned to deal with uncertainty issue and take all assumptions and beliefs into account.

Combination: Bayesian Neural Networks

Now it’s time to integrate everything we’ve illustrated above. Bayesian neural networks adhere to probabilistic model, which has a long history and is undergoing a tremendous wave of revival. Since most real-world problems have a particular structure, machine learning packages would be much better and powerful if they are customized to the problems with the structure embedded. The structure, or the mathematical model of the situation for which you are predicting, is commonly expressed by certain probability theory, of which Bayes’ Theory is one of the most famous and widely used one.

It is worth mentioning that neural network is a framework for constructing flexible models, while Bayesian belongs to algorithm, rather than model. It can be viewed as a framework for inference and decision making. By combining these two concepts together, it means to treat the neural network model in a Bayesian manner. And more specifically, it means to do a Bayesian inference over every parameter that is treated as random variable with distribution, instead of simply fitting the parameters all at once.

In neural network model, the parameters we care about is the great amount of uncertain network weights between layers. With Bayesian inference, we can come up with a probability distribution over the weights, given the distributional training data, to be used as next layer’s inputs. Therefore in the end, we will obtain an entire distribution over the network output, which increase our predicting accuracy and confidence level. At the same time, the issues related with regularization, overfitting and model comparison that appear in pure neural network can be naturally addressed. Undoubtedly, it is a natural and nice marriage between these two frameworks.

How I Know First Utilizes Bayesian Neural Networks for Forecasting

From the founding of the company I Know First has used probabilistic approach to forecasting. In Bayesian terms the historical data represents our prior knowledge of the market behavior with each variable (asset) having its own statistical properties which vary with time. Each asset interacts with other assets in a complex way. Through machine learning multiple competing models of such interactions are created. Each model has a certain view of the markets based on what it has learned from past history. Combining that view with the present values of the market gives a conditional distribution of future values. Together these independent competing models create a “cloud” of predictions, with measurable statistical parameters. The end result is a signal that is a weighted average of such predictions according to each past performance (~predictability).  After each trading day, new data is added to the data set and is learned, resulting in the updated view of the markets. Thus, each new forecast is based on the latest updated model of the market (Bayesian posterior).

Subscribe to I Know First’s Daily Market Forecast

Please note-for trading decisions use the most recent forecast. Get today’s forecast and Top stock picks.