Around a week ago, on ArXiv, an interesting research paper appeared, which is about the music style transfer using GAN, which is also my main topic for recent few months. Around a week ago, on arXiv, an interesting research paper appeared, which can be applied to the music style transfer using GAN, which is also my main topic for recent few months. There are already many researches on the style transfer of the images, and one of my main projects now is making the style transfer in music. If quality of this waveGAN is good enough, we will see the style transfer in music soon in industry.

The authors also opened the audios they built at this link. I think it’s guite good quality as a CNN based music generator.

There are 3 authors of the paper, and impressively 2 of them are from department of music. I spent much time to figure out the property of the sound and music. To discuss and research the topic with the experts of the music is a good approach.

The authors skipped much about the detail parameters, and open only the resultant spectrogram images and audio files. Thus, it is a bit limited to understand the full detail process of their model. Therefore, I want to discuss the paper with focusing on the differences of images and audio data in GAN. The paper introduced two kinds of GAN model, and WaveGAN seems more interesting to me, so I will proceed in terms of the WaveGAN.


As we discussed in the previous post to Introduce to GAN, deep convolution network has enhanced the GAN numerically. The parameters of convolutional neural networks (CNN) determines the character and quality of the GAN model. The CNN structures for images have been already researched for years, but those for music are not well-known. The sound generators were researched using RNN or other tool such as a WaveNet.


There is another issue. Spectrogram is not invertible. Spectrogram is a visualization of sound. The following image consists of 8x2 spectrograms from different genres of music.


So these look like 2D images with 1 colored channel (i.e. grayscale 2D images). Simply noting, spectrogram is made of the short-time Fourier transform (STFT) and some band filter. The STFT is a kind of Fourier transform and invertible. The filter used to play the role to compress the signal data. We human have the filter in the ear, too. We don’t receive the whole data from the sound. It goes through the filter in the ear, and makes it audible more effectively. However, these band filters are not invertible operations. Thus, need to design the invertible filter operates on the STFTed signal.


In the images, small deviation of the image would not disturb the recognition of the images such as faces. In spectrogram, time-translational deviation should not be allowed. The small deviation on time axis cause quite big noise in the inverted sound signal. In the paper, this problem is resolved by upsampling and interpolation.

Checkerboard artifact and phase shuffle

GAN model which upsampled by convolution are known to produce artifact in images. Roughly, 2D convolution makes checkerboard artifact because convolutional filter overlaps in the form of checkerboard. The easily explained detail including good figures can be found in the article written by A. Odena et al. The following figure from the article shows the checkerboard artifacts in 4 images.


Similarly, WaveGAN and its 1D convolution produce regular noise. As I explained above, this small noise must be removed. The author of the WaveGAN used phase shuffle.


At each layer of the WaveGAN discriminator, the phase shuffle operation randomizes the phase of each channel by [−n, n] samples, filling in the missing samples (dashed outlines) by reflection. This figure shows possible outcomes for 4 channels (n = 1).1

  1. I copied figure 5 and the caption of the figure 5 from the WaveGAN paper. [return]