When your boss says “Make noise suppression system”
In this article, we explore and compare different simple model for noise suppression.
First, search example
Then, find data
Let’s make something by ourselves
Basic idea
We want to create a model which predict the gain to apply to noisy frequencies to close clear frequencies.
We use STFT to obtain spectrogram of the input, then we predict and apply the gain to it, and finally we use inverse STFT to reconstruct a waveform.
- https://www.tensorflow.org/api_docs/python/tf/signal/stft
- https://www.tensorflow.org/api_docs/python/tf/signal/inverse_stft
Waveform to Spectrogram
def audioToTensor(filepath:str):
audio_binary = tf.io.read_file(filepath)
audio, audioSR = tf.audio.decode_wav(audio_binary)
audioSR = tf.get_static_value(audioSR)
audio = tf.squeeze(audio, axis=-1)
frame_step = int(audioSR * 0.008)#16000*0.008=128
spectrogram = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step)#->31hz, si 512 -> 64hz
spect_image = tf.math.imag(spectrogram)
spect_real = tf.math.real(spectrogram)
spect_sign = tf.sign(spect_real)
spect_real = tf.abs(spect_real)
return spect_real, spect_image, spect_sign, audioSR
Spectrogram to Waveform
def spectToOscillo(spect_real, spect_sign, spect_image, audioSR):
frame_step = int(audioSR * 0.008)
spect_real = pow(10, spect_real/20)#power value
spect_real*=spect_sign
spect_all = tf.complex(spect_real, spect_image)
inverse_stft = tf.signal.inverse_stft(spect_all, frame_length=frame_length, frame_step=frame_step, window_fn=tf.signal.inverse_stft_window_fn(frame_step))
return inverse_stft
Create data
In order to create data, we add noise to clear data.
Source: https://github.com/aruno14/noiseSupression/blob/main/create_noisy.py
Load data
Since we need the original input data in the fitting loss function, then we will append it to the clear data.
testSpect, _, _, _ = audioToTensor('data/noisy/book_00000_chp_0009_reader_06709_0_---1_cCGK4M.wav')
print("testSpect.shape:", testSpect.shape)max = 3000
x_train = np.zeros((max, voice_max_length, test.shape[1]))
y_train = np.zeros((max, v.shape[1]*2))
data_count = 0
for i, path_clear in enumerate(clear_files):
path_noisy = path_clear.replace("clear", "noisy")
spectNoisy, _, _, _ = audioToTensor(path_noisy)
spectClear, _, _, _ = audioToTensor(path_clear)
for k in range(0, len(spectNoisy)-voice_max_length):
x_train[data_count] = spectNoisy[k:k+voice_max_length]
y_train[data_count] = np.append(spectClear[k+voice_max_length], spectNoisy[k+voice_max_length])
data_count+=1
if data_count>=max:
break
if data_count>=max:
break
Prepare some models
First as baseline, we create a Single Dense layer, to predict clear frequencies from noisy one. This is not a good solution, since it can completely change the output and lose meaningful frequencies.
main_input = Input(shape=(int(frame_length/2+1)), name='main_input')
x = main_input
x = Dense(int(frame_length/2+1), activation='linear')(x)
model = Model(inputs=main_input, outputs=x)
Source: https://github.com/aruno14/noiseSupression/blob/main/test_dense.py
On below screenshot, we can see that the clear, noisy and predict sound waveform.
Second, we create a Single Dense Gain layer to predict the gain/coefficient to apply to input noisy frequencies to close clear one.
main_input = Input(shape=(int(frame_length/2+1)), name='main_input')
x = main_input
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)
Source: https://github.com/aruno14/noiseSupression/blob/main/test_dense_gain.py
Third, we create a 2dims Dense Gain model to predict gain to apply to frequencies for a period of time. Input x-axis represents frequencies and y-axis represents time.
main_input = Input(shape=(image_width, int(frame_length/2+1)), name='main_input')
x = main_input
x = Reshape((image_width, int(frame_length/2+1), 1))(x)
x = preprocessing.Resizing(image_width//2, int(frame_length/2+1)//2)(x)
x = Conv2D(34, 3, activation='relu')(x)
x = Conv2D(64, 3, activation='relu')(x)
x = MaxPooling2D()(x)
x = Dropout(0.1)(x)
x = Flatten()(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)
Source: https://github.com/aruno14/noiseSupression/blob/main/test_dense_image.py
Forth, we use LSTM to predict the gain.
main_input = Input(shape=(None, int(frame_length/2+1)), name='main_input')
x = main_input
x = LSTM(256, activation='tanh', recurrent_activation='sigmoid', return_sequences=False)(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input[:,-1]])
model = Model(inputs=main_input, outputs=x)
Source: https://github.com/aruno14/noiseSupression/blob/main/test_lstm_simple.py
We try with LSTM Sequence too.
main_input = Input(shape=(None, int(frame_length/2+1)), name='main_input')
x = main_input
x = LSTM(256, activation='tanh', recurrent_activation='sigmoid', return_sequences=True)(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)
Source: https://github.com/aruno14/noiseSupression/blob/main/test_lstm_sequence.py
Fifth, we combine all to create LSTM 2dims model.
main_input = Input(shape=(voice_max_length, parts.shape[1], parts.shape[2]), name='main_input')
x = main_input
x = TimeDistributed(Reshape((image_width, int(frame_length/2+1), 1)))(x)
x = TimeDistributed(preprocessing.Resizing(image_width//4, parts.shape[2]//4))(x)
x = TimeDistributed(Conv2D(34, 3, activation='relu'))(x)
x = TimeDistributed(Conv2D(64, 3, activation='relu'))(x)
x = TimeDistributed(MaxPooling2D())(x)
x = TimeDistributed(Dropout(0.1))(x)
x = TimeDistributed(Flatten())(x)
x = LSTM(256, activation='tanh', recurrent_activation='sigmoid', return_sequences=True)(x)
x = Dense(int(frame_length/2+1), activation='sigmoid')(x)
x = Reshape((voice_max_length, 1, 257))(x)
x = Multiply()([x, main_input])
model = Model(inputs=main_input, outputs=x)
Source: https://github.com/aruno14/noiseSupression/blob/main/test_lstm_image.py
General precision calculation
print('Evaluate...')
total_loss = 0
for i, path_clear in enumerate(clear_files):
path_noisy = path_clear.replace("clear", "noisy")
spectNoisy, _, _, _ = audioToTensor(path_noisy)
spectClear, _, _, _ = audioToTensor(path_clear)
result = model.predict(spectNoisy)
loss = np.mean(tf.keras.losses.mean_squared_error(spectClear, result).numpy())
total_loss+=loss
print(path_noisy, "->", loss)
print("total_loss:", total_loss/len(clear_files))
Example of spectrogram
Results comparison
Possible improvements
As used in RNNoise, we could use Bark scale to reduce input size.