Weight normalization implementation options for Keras and Tensorflow projects

November 10, 2018

At the time of writing this post no official implementation of weight normalization for Tensorflow and Keras is available. I therefore want to present a few options for implementing weight normalization in Tensorflow and Keras projects. The presented options for Keras only work in combination with the Tensorflow backend or with the Keras version that is bundled with Tensorflow.

Tim Salimans, the author of the weight normalization paper published code for Keras 1.x and an older version of Tensorflow (about two years ago). If you’re still using Keras 1.x or an older version of Tensorflow, I recommend using one of these implementations. The Keras-based implementation extends the Keras SGD and Adam optimizer with weight normalization functionality in a generic way so that different Keras layers (convolutional, dense, …) can be trained with weight normalization. A procedure for data-based initialization is also provided.

A Keras 2.x port of that code is available here. I tested the Adam-based weight normalization implementation in another project (WDSR for Keras) and get almost identical results as the PyTorch-based reference implementation (PyTorch already contains an official implementation of weight normalization). The authors of the WDSR PyTorch implementation recently published a Tensorflow port containing a 2D convolutional layer that supports weight normalization (Conv2DWeightNorm). This layer can also be used as Keras layer when using the Keras version bundled with Tensorflow 1.11 or higher and can be used in combination with any optimizer. Extending other layer types to support weight normalization should be easy using this template (but less elegant compared to a generic wrapper as described further below).

The Conv2DWeightNorm code doesn’t provide a procedure for data-based initialization though. For the special case of WDSR, this is not needed as data-based initialization has a similar effect as batch-normalization which is known to harm single image super-resolution training. I tested the Conv2DWeightNorm layer in the WDSR for Keras project too (see section Weight normalization) and convergence is similar to that of Adam-based weight normalization. However, the latter shows slightly better initial convergence but training with Conv2DWeightNorm in combination with a default Adam optimizer quickly catches up. The difference is probably due to different weight initializations.

There’s also a long-time open pull request for adding weight normalization to Tensorflow, also supporting the bundled Keras version but review is still pending. It is a generic wrapper layer that works for several types of Tensorflow and Keras layers. Data-based initialization is also supported but only in eager mode. I didn’t test it yet but the implementation looks promising and I’ll give it a try soon.

Googling will show you further implementations but most of them are slight variations of what is presented here. I hope this post saves you some time finding a weight normalization implementation for your project. If I missed something or an official implementation was published in the meantime, please let me know and I’ll update this blog post.



comments powered by Disqus