When your boss says “Make noise suppression system”

In this article, we explore and compare different simple model for noise suppression.

Arnaud

4 min readNov 13, 2020

First, search example

Then, find data

https://github.com/breizhn/DNS-Challenge

Let’s make something by ourselves

aruno14/noiseSupression

A simple Noise Suppression Model based on RNN and implemented with TensorFlow. GitHub is home to over 50 million…

github.com

Basic idea

We want to create a model which predict the gain to apply to noisy frequencies to close clear frequencies.

We use STFT to obtain spectrogram of the input, then we predict and apply the gain to it, and finally we use inverse STFT to reconstruct a waveform.

Waveform to Spectrogram

def audioToTensor(filepath:str):
    audio_binary = tf.io.read_file(filepath)
    audio, audioSR = tf.audio.decode_wav(audio_binary)
    audioSR = tf.get_static_value(audioSR)
    audio = tf.squeeze(audio, axis=-1)
    frame_step = int(audioSR * 0.008)#16000*0.008=128
    spectrogram = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step)#->31hz, si 512 -> 64hz
    spect_image = tf.math.imag(spectrogram)
    spect_real = tf.math.real(spectrogram)
    spect_sign = tf.sign(spect_real)
    spect_real = tf.abs(spect_real)
    return spect_real, spect_image, spect_sign, audioSR

Spectrogram to Waveform

def spectToOscillo(spect_real, spect_sign, spect_image, audioSR):
    frame_step = int(audioSR * 0.008)
    spect_real = pow(10, spect_real/20)#power value
    spect_real*=spect_sign
    spect_all = tf.complex(spect_real, spect_image)
    inverse_stft = tf.signal.inverse_stft(spect_all, frame_length=frame_length, frame_step=frame_step, window_fn=tf.signal.inverse_stft_window_fn(frame_step))
    return inverse_stft

Create data

In order to create data, we add noise to clear data.

Source: https://github.com/aruno14/noiseSupression/blob/main/create_noisy.py

Load data

Since we need the original input data in the fitting loss function, then we will append it to the clear data.

https://stackoverflow.com/questions/55445712/custom-loss-function-in-keras-based-on-the-input-data

testSpect, _, _, _ = audioToTensor('data/noisy/book_00000_chp_0009_reader_06709_0_---1_cCGK4M.wav')
print("testSpect.shape:", testSpect.shape)max = 3000
x_train = np.zeros((max, voice_max_length, test.shape[1]))
y_train = np.zeros((max, v.shape[1]*2))
data_count = 0
for i, path_clear in enumerate(clear_files):
    path_noisy = path_clear.replace("clear", "noisy")
    spectNoisy, _, _, _ = audioToTensor(path_noisy)
    spectClear, _, _, _ = audioToTensor(path_clear)
    for k in range(0, len(spectNoisy)-voice_max_length):
        x_train[data_count] = spectNoisy[k:k+voice_max_length]
        y_train[data_count] = np.append(spectClear[k+voice_max_length], spectNoisy[k+voice_max_length])
        data_count+=1
        if data_count>=max:
            break
    if data_count>=max:
        break

Prepare some models

First as baseline, we create a Single Dense layer, to predict clear frequencies from noisy one. This is not a good solution, since it can completely change the output and lose meaningful frequencies.

main_input = Input(shape=(int(frame_length/2+1)), name='main_input')
x = main_input
x = Dense(int(frame_length/2+1), activation='linear')(x)
model = Model(inputs=main_input, outputs=x)

Source: https://github.com/aruno14/noiseSupression/blob/main/test_dense.py

On below screenshot, we can see that the clear, noisy and predict sound waveform.

Second, we create a Single Dense Gain layer to predict the gain/coefficient to apply to input noisy frequencies to close clear one.

main_input = Input(shape=(int(frame_length/2+1)), name='main_input')
x = main_input
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)

Source: https://github.com/aruno14/noiseSupression/blob/main/test_dense_gain.py

Third, we create a 2dims Dense Gain model to predict gain to apply to frequencies for a period of time. Input x-axis represents frequencies and y-axis represents time.

main_input = Input(shape=(image_width, int(frame_length/2+1)), name='main_input')
x = main_input
x = Reshape((image_width, int(frame_length/2+1), 1))(x)
x = preprocessing.Resizing(image_width//2, int(frame_length/2+1)//2)(x)
x = Conv2D(34, 3, activation='relu')(x)
x = Conv2D(64, 3, activation='relu')(x)
x = MaxPooling2D()(x)
x = Dropout(0.1)(x)
x = Flatten()(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)

Source: https://github.com/aruno14/noiseSupression/blob/main/test_dense_image.py

Forth, we use LSTM to predict the gain.

main_input = Input(shape=(None, int(frame_length/2+1)), name='main_input')
x = main_input
x = LSTM(256, activation='tanh', recurrent_activation='sigmoid', return_sequences=False)(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input[:,-1]])
model = Model(inputs=main_input, outputs=x)

Source: https://github.com/aruno14/noiseSupression/blob/main/test_lstm_simple.py

We try with LSTM Sequence too.

main_input = Input(shape=(None, int(frame_length/2+1)), name='main_input')
x = main_input
x = LSTM(256, activation='tanh', recurrent_activation='sigmoid', return_sequences=True)(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)

Source: https://github.com/aruno14/noiseSupression/blob/main/test_lstm_sequence.py

Fifth, we combine all to create LSTM 2dims model.

main_input = Input(shape=(voice_max_length, parts.shape[1], parts.shape[2]), name='main_input')
x = main_input
x = TimeDistributed(Reshape((image_width, int(frame_length/2+1), 1)))(x)
x = TimeDistributed(preprocessing.Resizing(image_width//4, parts.shape[2]//4))(x)
x = TimeDistributed(Conv2D(34, 3, activation='relu'))(x)
x = TimeDistributed(Conv2D(64, 3, activation='relu'))(x)
x = TimeDistributed(MaxPooling2D())(x)
x = TimeDistributed(Dropout(0.1))(x)
x = TimeDistributed(Flatten())(x)
x = LSTM(256, activation='tanh', recurrent_activation='sigmoid', return_sequences=True)(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Reshape((voice_max_length, 1, 257))(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)

Source: https://github.com/aruno14/noiseSupression/blob/main/test_lstm_image.py

General precision calculation

print('Evaluate...')
total_loss = 0
for i, path_clear in enumerate(clear_files):
    path_noisy = path_clear.replace("clear", "noisy")
    spectNoisy, _, _, _ = audioToTensor(path_noisy)
    spectClear, _, _, _ = audioToTensor(path_clear)
    result = model.predict(spectNoisy)
    loss = np.mean(tf.keras.losses.mean_squared_error(spectClear, result).numpy())
    total_loss+=loss
    print(path_noisy, "->", loss)
print("total_loss:", total_loss/len(clear_files))