## KL-DivergenceKL divergence, abbreviated as Kullback-Leibler divergence, is a degree of how one probability distribution deviates from every other, predicted distribution. It is a notion from information theory and facts that is widely applied in domain names along with gadgets gaining knowledge of, statistics, and signal processing. ## MathematicallyGiven two probability distributions P(x) and Q(x) over the same domain x, the KL divergence from Q to P, denoted as D Or, for continuous distributions: KL divergence measures the information lost when Q is used to approximate P. It is not symmetric, meaning D To further understand KL-Divergence, we will attempt to approximate the distribution P (sum of two Gaussians) by minimizing its KL divergence with another Gaussian distribution Q.
## Loading Libraries## Gaussians ConstructionPytorch simplifies the process of obtaining samples from a certain distribution. Torch has a wide range of regularly used distributions. distributions First, let us create two Gaussians with ## Checking SanityLet's sample the distributions at certain locations to see if it's a Gaussian with predicted parameters.
The above figure shows that the distributions have been correctly constructed. Let's add the Gaussians and create a new distribution, P(x). Our goal will be to approximate this new distribution with another Gaussian Q(x). We shall discover the parameters μQ, and σQ by minimizing the KL divergence between the distributions P(x) and Q(x).
## Constructing Q(X)We will use a Gaussian distribution to approximate P(X). We are not sure which parameters will best represent the distribution P(x). So, let's start with μ=0 and σ=1. We might have picked better numbers because we are already familiar with the distribution we are attempting to approximate (P(x)). However, this is typically not the case in real-world settings.
## KL DivergencePytorch has a function for calculating KL divergence. It is important to remember that the input provided is anticipated to have log probabilities. The objectives are expressed as probabilities (without applying the logarithm). So, the function's first parameter will be Q, and the second will be P (the target distribution). We must also exercise caution when it comes to numerical stability.
We obtain divergence infinite when we exponentiate and then take the log. Using log values directly appears to be acceptable.
We will now define the function optimize_loss, which seeks to optimize a given loss function (loss_fn) in relation to a Gaussian distribution defined by mean (mu) and standard deviation (sigma).
Let's examine what happens when we try to solve the Mean Squared Distance between P and Q.
We can observe that the result differs significantly from the KL divergence example. There is no middle ground when we approach one of the Gaussian curves! You can experiment with different beginning values for μQ. If you select a number closer to 10 (the mean of the second Gaussian), it will converge to it.
This might easily apply to L1 loss as well. Now, let's examine what happens when we try to maximize the cosine similarity of two distributions.
As seen above in the 1D example, we converge on the closest mean values. In a high-dimensional environment with many valleys, minimizing the MSE/L1 Loss might have varied results. In deep learning, we randomly initialize the weights of neural networks. As a result, it stands to reason that various runs of the same neural network converge to distinct local minima. Techniques such as stochastic weight averaging may increase generalizability by assigning weights to distinct local minima. Different local minimas may encode crucial information about the dataset. Next TopicTransformers Architecture |