Speaker Recognition using TensorFlow

2 min readMar 29, 2021

Recently, I work on some audio processing I found many example for Word Recognition; however, there was no recent example for Speaker Recognition. So, I propose you my implementation.

Dataset

Speaker Recognition Dataset

Prominent leaders speeches

www.kaggle.com

Code

import csv
import io
import os
import numpy as np
import random
import glob
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Conv2D, MaxPooling2D, Dropout, Flatten, Reshape, AveragePooling2D
from tensorflow.keras.layers.experimental import preprocessingmodel_name = "model_speaker_reco"
frame_length = 512 #1024 -> minimal freq ->86Hz for 44100Hz; 512 -> -> minimal freq ->62.5Hz for 16000Hz
step_length = 0.008
image_width = 64 # 0.008*32=256ms
batch_size = 32
epochs = 25def audioToTensor(filepath):
    audio_binary = tf.io.read_file(filepath)
    audio, audioSR = tf.audio.decode_wav(audio_binary)
    audioSR = tf.get_static_value(audioSR)
    audio = tf.squeeze(audio, axis=-1)
    frame_step = int(audioSR * step_length)#328
    spectrogram = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step)
    spect_real = tf.math.real(spectrogram)
    spect_real = tf.abs(spect_real)
    partsCount = len(range(0, len(spectrogram)-image_width, image_width))
    parts = np.zeros((partsCount, image_width, int(frame_length/2+1)))
    for i, p in enumerate(range(0, len(spectrogram)-image_width, image_width)):
        parts[i] = spect_real[p:p+image_width]
    return parts, audioSRX_data = []
Y_data = []
weights = {}
for i, userFolder in enumerate(glob.glob("SRD/*")):#speakers SRD
    print(i, userFolder)
    for j, audiofile in enumerate(glob.glob(userFolder+"/*.wav")):
        #print("Analyse audiofile:", audiofile)
        parts, SR = audioToTensor(audiofile)
        for part in parts:
            X_data.append(part)
            Y_data.append(i)
            if i not in weights:
                weights[i]=0
            weights[i]+=1
Y_data=tf.keras.utils.to_categorical(Y_data)
print("len(X_data):", len(X_data))
print("weights:", weights)
print("nb cat:", len(weights))temp = list(zip(X_data, Y_data))
random.shuffle(temp)
X_data, Y_data = zip(*temp)X_data = np.asarray(X_data)
Y_data = np.asarray(Y_data)
print(X_data.shape)
print(Y_data.shape)if os.path.exists(model_name):
    print("Load: " + model_name)
    model = load_model(model_name)
else:
    main_input = Input(shape=(image_width, int(frame_length/2+1)), name='main_input')
    x = main_input
    x = Reshape((image_width, int(frame_length/2+1), 1))(x)
    x = preprocessing.Resizing(image_width//2, int(frame_length/2+1)//2)(x)
    x = Conv2D(8, 3, activation='relu', padding="same")(x)
    x = Conv2D(16, 3, activation='relu', padding="same")(x)
    x = MaxPooling2D()(x)
    x = Conv2D(16, 3, activation='relu', padding="same")(x)
    x = Conv2D(32, 3, activation='relu', padding="same")(x)
    x = MaxPooling2D()(x)
    x = Conv2D(32, 3, activation='relu', padding="same")(x)
    x = Conv2D(64, 3, activation='relu', padding="same")(x)
    x = AveragePooling2D(pool_size=3, strides=3)(x)
    x = Dropout(0.1)(x)
    x = Flatten(name="flatten")(x)
    x = Dense(16, name="identification")(x)
    x = Dense(len(weights), activation='sigmoid')(x)
    model = Model(inputs=main_input, outputs=x)
    tf.keras.utils.plot_model(model, to_file='model_id.png', show_shapes=True)
    model.summary()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])model.fit(X_data[0:int(0.8*len(X_data))], Y_data[0:int(0.8*len(X_data))], epochs=epochs, batch_size=batch_size, class_weight=weights, validation_data=(X_data[int(0.8*len(X_data)):], Y_data[int(0.8*len(X_data)):]))

More explanation about audio processing here: https://medium.com/swlh/a-journey-to-speech-recognition-using-tensorflow-1fc1169fef99

Result

A random choice would have a precision of 0.2. After 25 epochs we obtained 0.9913.

Improvement

With the previous model, we can only predict user from the users we used for training. We can not identify an unknown user.

In order to compare and to identify a new user, we can use the penultimate layer of our model; this output will be the feature values of our input. Similar feature values mean same speaker, different ones mean different speaker.

Azure API

Recently, Microsoft provide an API for user recognition. I can only identify registered (learned) users.

https://azure.microsoft.com/en-us/services/cognitive-services/speaker-recognition/