KL Divergence Intuition

Is this picture a dog or a panda?

6 min readOct 18, 2023

Is my dog a panda according to my computer?

This is my dog Lucy, isn’t she adorable? I can look at this photo and say, thats a dog! Although it may be more difficult for my computer to detect what type of animal she is. Lucy also sort of looks like a panda…

Which might confuse the computer. Let’s say we have a neural network and it is given images of my dog and pandas. I want to see how accurately my computer can say “hey thats a dog!” or “hey thats a panda!”. This is where KL divergence comes into play.

What is KL divergence?

KL divergence calculates the difference between two different probability distributions. The first probability distribution is a “true” distribution, and the second probability distribution is an approximation of the “true” distribution. Some people define KL divergence as the “distance” between two probability distributions, although it is important to remember that this is not true. Let D_kl represent the KL divergence function, D_kl(P || Q) is NOT always the same as D_kl(Q || P) so KL divergence is not symmetric so it cannot be a distance.

It is defined as:

The criteria in KL divergence is that we must have two different distributions, P(x) and Q(x) over the same random variable. This means P and Q are defined on the same sample space of possible outcomes for some random variable x.

In the context of determining whether my dog is a dog or a panda, we have a biased point of view (the neural network) and the reality (the actual truth of what an image represents). Let P(x) represent the probability distribution of what an image actually is and Q(x) represent the probability distribution of what the neural network thinks the image is. θ on the P(x) just means we have perfect parameters, w on the Q(x) represents the weights of the neural network.

Log Likelihood

Let’s quickly talk about what log likelihood means, since this is significant to understanding the KL divergence formula.

Breaking down log likelihood (literally into the two different words), what is likelihood?

Likelihood is a measure of how well a particular model explains the data we have observed. If I show my neural network a picture of my dog, I want it to fit the image to a label of a dog and NOT a panda.

Let L(θ | x) represent the likelihood of the model parameters θ given observed data x.

θ → These are the parameters of the model. Think of them as the knobs you can adjust to change the behavior of your model.

x: This represents the data you have observed or collected at some specific point in time.

L(θ∣x): This is the likelihood. It tells you how well the model, with parameters θ, fits the observed data x.

Typically we would be given multiple data points so L(θ | x) actually looks like L(θ | x1, x2, x3…). We also assume that the data points are all independent of each other. Due to independence p(x1, x2, x3… |θ) is actually just equivalent to the probability of each data point multiplied by eachother, so p(x1 | θ) . p(x2 | θ) . p(x3 | θ) … The same property applies to the likelihood of some model so L(θ | x1, x2, x3…) is equivalent to L(θ | x1). L(θ | x2) . L(θ | x3)…

Why do we take the log of the likelihood?

That just looks like → log(L(θ | x)).

We use log to obtain some specicial properties. So instead of multiplying all the likelihoods, we can simply take the sum of all of them. When multiplying likelihoods, there is a large possibility of arithmetic underflow occuring. This is when our number gets so small that it cannot be represented properly.

Log is also monotonic and increasing, which is an important fact. It allows us to maximize the likelihood, meaning we can find the parameters of a model to make the observed data as “likely” as possible. If one value is greater than another, its logarithm will also be greater, so value of the parameters that maximizes the likelihood will also maximize the log-likelihood, and vice versa. Maximizing the log-likelihood is equivalent to maximizing the likelihood itself.

Putting it all together

Let’s look at the definition again

In probability, we have two different types of sample spaces, discrete sample spaces and continuous sample spaces. Discrete spaces are when you have a finite amount of outcomes, for instance for a die you can only roll the following values: 1, 2, 3, 4, 5, 6. Continuous spaces have outcomes within an interval, for example measuring the amount of rainfall in a forest in a year. You could have 0 mm of rain or 500 mm of rain, you could say the amount of outcomes lies in the range of [0, infinity).

I mention the difference between the specific sample spaces because you calculate the expectation differently based on if the space is discrete or continious. The above formula shows KL divergence calculated for discrete sample spaces, since we are taking the summation. However, if we wanted to calculate KL divergence for a continuous sample space, we would just take the integral from (-infinity, + infinity). KL divergence is just the expectation of its log likelihood multiplied by the probability of the variable.

Computing KL Divergence

This is another picture of my silly doggo. Lets say we feed this picture to a neural network, and ask if it is a dog or a panda.

Our truth distribution is that:

P(dog) = 1

P(panda) = 0

Lets say our neural network is just really bad and it estimates:

Q(dog) = 0.3

Q(panda) = 0.7

Now lets do the math:

Note that 0/0.7 is undefined, so we must multiply it by 0. This 1.73 isn’t a great result since its not very close to 0.

Let’s say we now have a better neural network and it estimates the following:

Q(dog) = 0.9

Q(panda) = 0.1

Which gives us:

This is a way better result since 0.15 is way closer to 0 than 1.73.

Note that if P = Q, the KL divergence would return 0, meaning that our neural networks estimation was perfect!

A quick note✨

Hi, I’m Ashley! I’m a sophmore at Carnegie Mellon University studying computer science and I LOVE writing about things that interest me.

My email is ashleycinquires@gmail.com and If you have any questions about anything, I’ll try to answer them. I do appreciate all the positive messages I’ve been sent and I’m trying to be more active via email. I love talking to people and I also appreciate feedback and suggestions about what to write.

Hopefully more reinforcement learning/ mathy content soon? Have lots of drafts I’m going through to publish! Let’s see where my interests take me :P

I’ll see you soon :)

Ashley ❤