VAE-GAN Part6: VAE-GAN with adoptive β 2
After running more experiments, I found that the conventional computation of KLD as described in the previous post works fine.
new_beta = beta - beta_step * (target_kl - avg_kl)
new_beta = max(new_beta, 0)
beta_step = 0.1
initial_beta = 10.0
It seems that when target KLD is too small (i.e. 0.05), projected samples are too much concentrated and the detail of the input images cannot be recovered.
For larger side of target KLD (0.2, 0.5), larger values seems to yield better reconstruction. However, at target KLD = 0.1, the convergence of the measured KLD shows very different tendency from the rest. Without repeating the same experiments, it's hard to say if there is something special about target KLD = 0.1, or it is an outlier tendency caused by randomness.
The peculiar tendency of target KLD = 0.1, can be also seen in $\beta$ value tendency. The next figure shows how $\beta$ is changed over the training. In case of target KLD = 0.1, the value of $\beta$ is order of magnitude larger than the rest. The encoder is trying so hard to move the latent samples father from the origin, so as to reduce reconstruction error.
At this point, I realize that actually, this is derived from the scale of feature matching error.
Right now, the implementation of feature extraction in descriminator is, multiple convolution without any non-linear activations.
nn.Sequential(
nn.ReflectionPad2d(2),
nn.Conv2d(3, 32, kernel_size=5),
nn.ReflectionPad2d(2),
nn.Conv2d(32, 128, kernel_size=5, stride=2),
nn.ReflectionPad2d(2),
nn.Conv2d(128, 256, kernel_size=5, stride=2),
nn.ReflectionPad2d(2),
nn.Conv2d(256, 256, kernel_size=5, stride=2),
)
What can I do to circumvent this? Applying Batch Normalization is a saimple way to regulate output value range. Quick googling and I found some implementations.
- The official implementation of Improved Techniques for Training GANs uses (Convolution -> Batch Normalization -> Leaky ReLU) and linear transform.
- This implementation uses Leaky ReLU and drop out.