Fun with ML on Splatoon 2     About     Archive

VAE-GAN

I built VAE-GAN and trained models with Splatoon 2 video screens. Code and model is available here.

I started from AE, then VAE and finally moved onto VAE-GAN. Introducing GAN was tricky. Unlike AE or VAE, which at least produce something resemble input images, GAN does not produce anything if the training is not setup correctly. I refered Torch blog, lucabergamini's implementation and Jonathan Hui's post, and employed different tricks to make it work, but luckily I could make it work with almost vanilla form. The loss function at this point is as follow;

$$ \begin{align*} \mathcal{L} &= KLD + \mathcal{L}_{D} + \mathcal{L}_{G} + \mathcal{L}_{F} \end{align*} $$ $$ \begin{align*} KLD &= \mathbb{E}[D_{KL}(\mathcal{N} \| P(z | x) )] \\ \mathcal{L}_{D} &= \mathbb{E}[\log(D(x))] + \mathbb{E}[\log(1 - D(G(z|x)))] + \mathbb{E}[\log(1 - D(G(z)))] \\ \mathcal{L}_{G} &= \mathbb{E}[\log(D(G(z|x)))] + \mathbb{E}[\log(D(G(z)))] \\ \mathcal{L}_{F} &= \mathbb{E}[{\| \phi(G(z|x)) - \phi(x) \|}^2] + \mathbb{E}[{\| \phi(G(z)) - \phi(x) \|}^2] \end{align*} $$
which is sum of KL divergence, discriminator loss, generator loss and feature matching error.

In addition to these losses, I also tracked pixel error between input images and reconstructed images.

$$ \mathcal{L}_{pixel} = \mathbb{E}[{\|x - G(z|x)\|}^2] $$

Let's see how these metrics change as training progresses. Firstly, the pixel loss. Though it is not directly optimized, it shows a nice learning curve.

Next is discriminator and generator loss. It is not easy to get insights just from this, but in combination with other observations, there are some hypothesis we can make.

  1. Fake images $G(z), z \sim \mathcal{N}$ do not contribute to the optimization.
  2. The discriminator loss is always smaller than the rest by orders of magnitude, occasionally reaching zero, meaning it is not providing a good feedback to generator. In fact, the fake images do not resemble real image at all.
  3. In training, the discriminator found original images and reconstructed images equally difficult to tell.

The following figure is the same GAN loss without fake images.

Now, the interesting part. The feature matching loss increases as training progresses while pixel loss is decreasing. In the current implementation, the feature is computed as multiple convolutions without any non-linear activation. (I do not know if lack of non-linearity is an issue here.) This means that discriminator is working in a way that it amplifies the difference between original images and reconstructed images.

Finally, something puzzling me. The KL divergence of encoded vectors increases. The latent expression of input images are not pushed to Gaussian distribution. Generator-descriminator scheme is pushing latent vector outside of normal distribution. This explains why fake images generated from random sample drawn from normal distribution do not resemble any real. This is strange. This model is supposed to be a VAE and GAN, both of which should allow sampling from normal distribution. The increased KLD trait might be somewhat connected to Disentangling β-VAE, in which KL divergence is constrained to be close to capacity constant.

Maybe, the use of inistance normalization, instead of batch normalization, is causing this.

References