Fun with ML on Splatoon 2     About     Archive

VAE-GAN Part7: Batch KLD

In the previous post, I hypothesised that the explosion of KL-Divergence occurs when optimizing feature matching because the output from feature extraction part of discriminator is not regulated and the values of gradient can become large. Then I thought simply adding Batch Normalization at the end of feature extraction would do. It, however, did not. With Batch Normalization, feature matching error is regulated, but is no longer giving a good gradient back to encoder/decoder (it seems).

While I was looking for a way to surpress the explostion of feature matching error, so that KL-Divergence behaves better, I also came to wonder if KLD on batch can help. To investigate this, I ran couple of experiments.

Batch KLD

Firstly, in addition to the previous definition of Batch KLD, I incorperated moving average.

# Get mu and logvar from encoder
z_mu, z_logvar = encoder(input_image)

# Generate latent sample
z_std = torch.exp(0.5 * z_logvar)
sample = z_mu + z_std *torch.randn_like(z_std)

# Compuet mean and variance of samples over batch
sample_mean = torch.mean(sample, dim=0)
sample_var = torch.var(sample, dim=0)

# Apply moving average
mean = momentum * current_mean + (1 - momentum) * sample_mean
var = momentum * current_var + (1 - momentum) * sample_var

# Cache moving stats
current_mean = mean.detach()
current_var = var.detach()

# Compute KLD
logvar = torch.log(var.clamp(min=1e-12))
kld = - 0.5 * (1 + logvar - mean.pow(2) - var)

In cotrast the typical KLD computation (referred as Single Point KLD hereafter) is as follow;

mu, logvar = encoder(input_image)
var = logvar.exp()
kld = - 0.5 * (1 + logvar - mu.pow(2) - var)

Experiment 1

I ran two sets of experiments with different parameter set. In the first experiment, the parameters for adjustment of $\beta$ are as following, which is same as the previous post.

beta_step = 0.1
initial_beta = 10.0

in this experiment, I changed how KLD is computed. Single Point KLD, Batch KLD with momentum 0.9, momentum 0.1 and momentum 0.0.

Observations

  1. The reconstruction error improved with batch KLD constraint.
  2. When momentum value for Batch KLD is small, the value of KLD for test cases deviate from target KLD.
  3. The feature matching error values have become lower with batch KLD.
  4. $\beta$ grows larger in Single Point KLD and batch KLD with momentum=0.9.

In addition to the above, I recorded some statistics of decoder output, Z_MEAN and Z_STDDEV. For Z_MEAN, Batch KLD has a broader skirt than Single Point KLD. For Single Point KLD, the value of standard deviation of latent points (Z_STDDEV) is distributed near 1.0, but not that does not happend for Batch KLD. This is expected as, in Batch KLD, the whole distribution of latent samples are optimized towards normal distribution. However, the value of Z_STDDEVs are too small and it is virtually making no difference when sampling from the latent distribution.

Experiment 2

For the second set of experiments, the parameters for adjustment of $\beta$ are as following;

beta_step = 0.01
initial_beta = 1.0

I changed KLD computation in the same manner as exp 1; Single Point KLD, Batch KLD with momentum 0.9, momentum 0.1 and momentum 0.0.

In this experiments, the observations 1 - 4 from the experiment 1 are also observed.

  1. The reconstruction error improved with batch KLD constraint.
  2. When momentum value for Batch KLD is small, the value of KLD for test cases deviate from target KLD.
  3. The feature matching error values have become lower with batch KLD.
  4. $\beta$ grows larger in Single Point KLD and batch KLD with momentum=0.9.

For this experiment, the distribution of latent parameters show somewhat different trend. Z_STDDEV at the beginning of training (before fake samples start collapse) is more distributed.

Read more

VAE-GAN Part6: VAE-GAN with adoptive β 2

After running more experiments, I found that the conventional computation of KLD as described in the previous post works fine.

new_beta = beta - beta_step * (target_kl - avg_kl)
new_beta = max(new_beta, 0)
In the following experiments, I fixed the following parameters and changed target KLD. The code and model for the experiments is found here.
beta_step = 0.1
initial_beta = 10.0
Read more

VAE-GAN Part5: VAE-GAN with adaptive β

In the previous post, I could successfully map the latent distribution around the origin, while reducing the reconstruction error, and as the result, I could create fake images from randomly sampled latent vectors. However the adoptation of Batch Normalization still felt cheating, so I looked for a way to archive a similar performance without Batch Normalization. Along the way I found this paper by Peng et al via this medium article. The model being discussed in the paper is Variational GAN, which is different from VAE-GAN, but it has the exact same requirement I am handling, which is enfording the distribution of intermediate output to be Gaussian, and the paper proposed adjusting $\beta$ in a way that KL Divergence is always close to target value.

... enforcing a specific mutual information budget between $\mathbf{x}$ and $\mathbf{z}$ is critical for good performance. We therefore adaptively update $\beta$ via dual gradient descent to enforce a specific constraint $I_{c}$ on the mutual information. $$ \mathcal{\beta} \leftarrow max(0, \beta + \alpha_{\beta} (\mathbb{E}_{\mathbf{x} \sim \tilde{p}(\mathbf{x})}[KL[E(\mathbf{z}|\mathbf{x})||r(\mathbf{z})]] - I_c )) $$
Read more
Read more

βVAE-GAN?

Looking at Frame by Frame Reconstruction, and reading through papers on VAE/GAN, I noticed that the quality of my reconstructions are terrible. Also the fact that KL-divergence term is not converging was still bothering. So I added β multiplier to KL-divergence term and observed how it affects the behavior of latent samples. The previously trained model was $\beta=1$, so I changed β to 0.1, 2.0, 4.0, 8.0.

Let's see how KL divergence and pixel error change.

Read more

Frame by Frame Reconstruction

I ran the current VAE-GAN model over some battle scenes to see what they look like.

Read more

VAE-GAN Part 3

While I was trying another method to improve my VAE-GAN implementation, I encountered OOM, so I refactored my code around forward/backward computation, following the PyTorch's DC-GAN implemenation. In this implementation, input batch is fed to the generator, and discriminator separately and loss and gradients are computed each time. Mean while, my implementation was feeding input batch to network and all the loss are computed first then each component of is updated one by one. It turned out that not only this produces gradient for already updated model, but also not as fast in PyTorch.
Read more

VAE-GAN Part 2

From the observation made in the previous post, I ran the same training, but without fake images genrated from random sample $z \sim \mathcal{N}(\mathbf{0}, \mathbf{1})$.

Read more

VAE-GAN

I built VAE-GAN and trained models with Splatoon 2 video screens. Code and model is available here.

Read more