Literature Review for Speaker Change Detection

Draft version. So that, there can be many typos and unreferenced quote. Also, please reach me, if you want to add different paper. Feel free to send e-mail to me.

UPDATE(17 October 2018): After the conversation with Quan Wang, I am trying to keep my blogpost up-to-date. I am very grateful because of his help and effort. I have added FULLY SUPERVISED SPEAKER DIARIZATION. I will add Herve Bredin’s new paper as soon as possible.

UPDATE(29 October 2018): I have added Neural speech turn segmentation and affinity propagation for speaker diarization.

UPDATE(12 November 2018): Quan Wang’s and his team publish the source code of the FULLY SUPERVISED SPEAKER DIARIZATION

UPDATE(21 February 2019): Quan Wang release the lecture about UIS-RNN. I highly recommend.

Speaker diarization is the task of determining “who spoke when” in an audio stream that usually contains an unknown amount of speech from an unknown number of speakers. Speaker change detection is an important part of speaker diarization systems. It aims at finding the boundaries between speech turns of two different speakers.

alt text

This slide is belongs to this video. I highly recommend it. :)

Before papers, I just want to share some useful datasets.

1) Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks

  • “In our work we propose a system that incorporates both lexical cues and acoustic cues to build a system closer to how humans employ information.” They use the lexical information to improve results. If the script of recording is available, they directly use it. If not, ASR has been used to extract lexical cues.

  • In this work, main architecture is sequence to sequence(seq2seq) which summarize the whole sequence into an embedding. Moreover, it can integrate information and process variable length sequences. Thus, this model can capture temporally encoded information from both before and after the speaker change points. Also, this model use the attention mechanism so that system can learns which information is most important to characterize the speaker.

alt text

alt text

  • Encoder takes the sequence of word representation and MFCC (13 dimensional. Extracted with a 25ms window and 10ms shift. Decoder produces a sequence of word with speaker IDs. Thus, system can learn speaker change points.

    • Source sequence is 32 words (one hot word vector) which comes from reference script or ASR output. Target sequences is 32 words and added speaker turns tokens.
  • To maximize the accuracy of speaker turn detection, they use the shift and overlap scheme to predict the speaker turn.

alt text

2) Speaker2Vec: Unsupervised Learning and Adaptation of a Speaker Manifold using Deep Neural Networks with an Evaluation on Speaker Segmentation

Their aim is derive a speaker-charecteristic manifold learned in an unsupervised manner.

Note: State-of-the-art unsupervised speaker segmentation approaches are based on measuring the statistical distance between two consecutive windows in audio signal. For instance: BIC, KL Divergence. These methods use some low level features like MFCC for signal parameterization

  • They assume that temporally-near speech segments belong to the same spekar so that such a joint representation connecting these nearby segments can encode their common information. Thus, this bottleneck representation will be capturing mainly speaker-spesific information. When test this system, simple distance metric is applied to detect speaker change points.

  • Given any small segment (say 1 second) of speech, a trained Speaker2Vec model can find its latent “representation vector” or “embedding” which contains mostly speaker-specific information.

  • They train a DNN on unlabeled data to learn a speaker-characteristics manifold, use the trained model to generate embeddings for the test audio, and use those embeddings to find the speaker change points.

Methodology of this work

  • Tries to learn speaker characteristics manifold with autoencoder.

    • They do not try to reconstruct input. They try ro reconstruct a small window of speech from a temporally nearby window. According to their hypothesis, given the two windows belong to same speaker. (As we guess, at some case, this assumption is not true. However, if we look at the rate of this situation, it will be negligible. Also, after the first training, they train the system on homogenous segments of speech.) With this reconstruction, the system can get rid of unnecessary features and capture the most common information between two window. Thus, it can learn speaker-characteristic manifold.

alt text

  • For the segmentation, system use the embeddings instead of original MFCC features. They use the asymmetric KL divergence for segmentation.

  • They use two-pass algorithm.

    • Find the speaker change points by trained DNN model.
    • Get all possible speaker homogeneous regions.
    • Retrain the same DNN again on these homogeneous segments of speech.

Experiment

  • They use data from TED-LIUM and Youtube for training. To compare baseline methods, they use TED-LIUM evaluation data.

alt text

  • There are 2 different architecture. One for TED-LIUM, Youtube (4000 - 2000 - 40 - 2000 - 4000) and other one for YoutubeLarge. (6000 - 2000 - 40 - 2000 - 6000 - 4000) The embeddings layer is always 40. Because, they want to represent MFCC dimension with this layer.

  • They compare their result with state-of-art methods on the artificially created TIMIT dataset.

alt text

3) TRISTOUNET: TRIPLET LOSS FOR SPEAKER TURN EMBEDDING

“TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes.”

alt text

This figure summarizes main idea. When train the system, system takes three different sequence (Anchor and positive belongs to same class and negative comes from different speaker.) and converts to embedding. After that, triplet loss is applied to these embeddings. The triplet loss functions’s aim is that minimizes distance between embeddins of anchor and positive and maximize distance between embeddings of anchor and negative.

alt text

This figure depicts how embedding is created from sequence.

4) Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks

This project is open source.

Speaker change detection is like a binary seqeunce labelling task and addressed by Bidirectional long short term memory networks. (Bi-LSTM)

Previously, writers proposed TristouNet, at that system, euclidean distance is used. But, that system tend to miss boundaries in fast speaker interactions because of relatively long adjacent sliding windows. (2 seconds or more)

Note: “In particular, our proposed approach is the direct translation of the work by Gelly et al. where they applied Bi-LSTMs on overlapping audio sequences to predict whether each frame corresponds to a speech region or a non-speech one.” - Gelly et al.’s paper

alt text

They use the MFCC which comes from overlapping slicing windows as input, output is binary class. System use the binary cross-entropy loss function to train.

alt text

  • Bi-LSTMs allow the process sequences in forward and backward directions, making use of both past and future contexts.

  • To solve class imbalance, the number of positive labels is increased artificially by labelling as positive every frame in the direct neighborhood of the manually annotated change point. Positive neighboorhood of 100ms (50ms on both side) is used around each change point.

  • Long audio sequences are split into short fixed-length overlapping sequences. These are 3.2s long with a step of 800 ms.

Experiment

  • They use ETAPE TV subset.
  • MFCC as input.
  • Baselines are BIC, Gaussian Divergence (both of them use 2s adjacent windows) and TristouNet.

alt text

  • “We have developed a speaker change detection approach using bidirectional long short-term memory networks. Experimental results on the ETAPE dataset led to significant improvements over conventional methods (e.g., based on Gaussian divergence) and recent state-of-the-art results based on TristouNet embeddings.”

5) Neural speech turn segmentation and affinity propagation for speaker diarization

They divide speaker diarization system to 4 sub-tasks:

  • Speech Activity Detection (SAD)
  • Speaker Change Detection (SCD)
  • Speech Turn Clustering
  • Re-segmentation

Herve Bredin’s previous paper explain the how they solve the Speech Activity Detection(SAD) and Speaker change Detection(SCD) via recurrent neural network, however, they used traditional methods to solve other 2 sub-taks at that paper. With these paper, they develop new approach to solve speaker diarization problem jointly.

alt text

  • Use LSTM for re-segmentation
  • Use Affinity propagation for speech turn clustering

We can list the contribution of this paper as:

  • Adapt LSTM-based SAD and SCD with unsupervised resegmentation. Previously, a GMM is trained for each cluster(speech segments which include same speaker) re-segment these with Viterbi decoding.

  • Use affinitity propagation clustering on top of neural speaker embeddings. (In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space. Source)

  • Joinly optimize whole steps which are LSTM-based SAD, LSTM-based SCD, LSTM-based speaker embeddings and LSTM-based re-segmentation (Just speech turn clustering is not based on RNN)

Now, let’s deep dive into these contributions.

Sequence Labeling based on LSTM

At the previous paper, they used sequence labeling based on LSTM for speaker change detection and speech activity detection. With these modules, DNN create initial segmentation. (For more info, please check previous summary)

At that paper, they used same method (LSTM based) for re-segmentation. Previously, re-segmentation is usually solved by GMMs and Viterbi decoding. At test time, using the output of the clustering step (initial segmentation) as its unique training file, the neural network is trained for a tunable number of epochs E and applied on the very same test file it has been trained on. After that, resulting sequence of K-dimensional score has been post-processed to determine new speech segments.

Drawback of this resegmentation is that increase false alarm.

Clustering

Speech turn clustering is solved by combination of he neural embeddings and affinitity propogation.

At the neural embedding stage, we are trying to embed speech sequences into a D-dimensional space. When we embed whole sequences into this space, we expect that if two sequences comes from same speaker, they will be closer in this space. (Their angular distance will be small) To get embed of one segment, we need to process variable length segment, however, we should have fixed-length embeddings. To solve this problem,

  • Slide a fixed length window
  • Embed each of these subsequences
  • Sum these embedding

alt text

The goal of SAD and SCD is to produce pure speaker segments containing a single speaker. The clustering stage is then responsible for grouping these segments based on speaker identities.

Herve Bredin and his team choose affinity propagation (AP) algorithm for clustering. Ap does not require a prior choice of the number of clusters. These means that, we do not have to specify how many speakers are there in the whole speech. All segments are potential for cluster centers. (These centers represent different speakers) When algorithm decide to examplers (cluster centers), it uses negative angular distance between embeddings to understand similarity. (I do not want to give whole mathematics behind this algorithm. Please check this wonderful blogpost)

Joint Optimization

Mostly, speaker diarization modules are tuned with empiricially(trial-and-error) Also, whole modules are tuned independently. Researcher use Tree-structured Parzen Estimator for hyper-parameter optimization. This method is available in hyperopt.

Experiments

This project is open-source. So, you can reproduce results. Herve Bredin and his team deserves Kudos. :)

For feature extraction, they use Yaafe toolkit and use 19 MFCC, their first and second derivatives and the first and second derivatives of the energy. It means that, input is 59 dimensional.

For sequence labeling, SAD, SCD and re-segmentation modules share a similar network architecture.

alt text

For dataset and evaluation metric, they use French TV broadcast.

alt text

They compare their results with two alternative approach

  • Variant of the proposed approach, just they use standard hierarchical agglomerative clustering instead of affinity propogation
  • S4D system which is developed by LIUM. This method use following approach: “Segmentation based on Gaussian divergence first generates (short) pure segments. Adjacent segments from the same speaker are then fused based on the Bayesian Information Criterion (BIC), leading to (longer) speech turns. Hierachical clustering based on Cross-Likelihood Ratio then groups them into pure clusters, further grouped into larger clusters using another i-vector-based clustering.”

alt text

As we know from re-segmentation step, we need to determine E to get best score. When we look at this figure, we can see that, we need same number of epochs for development and test set. This means that, LSTM-based re-segmentation is stable.

alt text

Results and Conclusion

  • This pipeline is big step to reach integrated end-to-end neural approach to speaker diarization. Because, researcher show that initial segmentation and re-segmentation can be formulated via LSTM based sequence labeling.

  • Affinity propagation outperforms the standart agglomerative clustering with complete-link.

Future Direction

  • “However, in re-segmentation step, finding the best epoch E relies on a development set. We plan to investigate a way to automatically select the best epoch for each file.”
  • “In addition, though neural networks can be used to embed and compare pairs of speech segments, it remains unclear how to do also cluster them in a differentiable manner.”

6) Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

They are trying to solve speaker diarization problem via 2-step approach.

  • To classify speaker, train a NN in a supervised manner. When they train the system, weighted spectogram is used as an input and cross-entropy as a loss function.

    • “Weighting STFT with proper perceptual weighting filters may overcome noise and pitch variability.” Also, they applied some pre-processing like downsampling and hamming window.
  • Use this pretrained NN to extract speaker embeddings which is time-dependent speaker charecteristic.

After that, system compare embeddings via cosine similarity. If difference is bigger than determined threshold, system say this comes from different speaker.

Experiments

  • “To evaluate our method and compare it with the state of the art, we use following publicly available datasets: AMI meeting corpus [39] (100 hours, 150 speakers), ISCI meeting corpus [40] (72 hours, 50 speakers), and YouTube (YT) speakers corpus [41] (550 hours, 998 speakers). Also, they release open source dataset from broadcast material which comes from major new stations.

alt text

  • They split the data into training and validation with the proportion of %70 and %30.
  • Their baseline is state-of-art LIUM Speaker Diarization System which is baes on GMM classifier and uses 13 MFCC audio features as input. Also, they compare R-CNN with CNN via different features to understand effect of feature extraction.
  • Their performance metric is Diarization Error Rate(DER) for evaluation.

alt text

  • “The results of the evaluation can be seen in Tab. 2. Our proposed deep learning architecture based on recurrent convolutional neural network and applied to CQT-grams outperforms the other methods across all datasets with a large margin. Its improvement reaches over 30% with respect o the baseline LIUM speaker diarization method with default set of parameters.”

7) SPEAKER DIARIZATION WITH LSTM

“In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system.

Their sistem is combination of

  • LSTM-based speaker verification model to extract speaker embeddings
  • Non-parametric spectral clustering. They apply clustering algorithm to these embeddings in order to speaker diarization.

They obtain a state-of-art system for speaker diarization with this combination.

alt text

They have tried 4 different clustering algorithm in their paper. Two of them is belongs to online clustering (system label the segment when it is available, without seeing feature segments) and two of them is belongs to offline clustering (system label the segment when all segments are available). Offline clustering outperforms the online clustering.

  • Naive online clustering, Links online clustering
  • K-means offline clustering, Spectral offline clustering

Spectral offline clustering algorithm consists of the following steps:

  • Construct the affinity matrix. This matrix’s elements represent the cosine similarity between segment embedding.

  • Apply some refinement operations on the affinity matrix
    • Gaussian Blur to smooth the data. With this, reduce the effect of outliers.
    • Row-wise thresholding (for each row, if elementssmaller than some threshold, set this element to 0)
    • Symmetrization to restore matrix symmetry which is crucial for algorithm.
    • Diffusion to sharpen the matrix. Thus, we have more clear boundaries between speakers.
    • Row-wise max normalization to get rid of undesirable scale effects.
  • Perform eigen decomposition.
For more info, please look the paper

alt text

The writers discuss why we can not conventional clustering algorithms like K-mean. The problem comes from speech data’s properties.

  • Non-Gaussian Distribution: Speech data are often not-gaussian.
  • Cluster Imbalance: For most of the recordings, mostly one speaker speaks. And if we use K-means, unfortunately, it can split this cluster into smaller cluster.
  • Hierarchical Structure: The difference between one male and one female speaker is more than the difference between two male’s clusters. This property, mostly, cause to K-means cluster all male’s embeddings into one cluster and all female’s embeddings into another cluster.

So that, they offer the novel non-parametric spectral clustering to solve these problems.

Experiment

  • VAD is used. – They use pyannote.metrics library for evaluation.
  • Fine tune parameters for each dataset.
  • For CALLHOME dataset, they tolerate errors less than 250 ms in locating segment boundaries.
  • Exclude overlapped speech.
  • In general, they observed that d-vector based systems outperform i-vector based systems.
  • They compare their result with state-of-art algorithms on CALLHOME dataset.

alt text

Poster of the paper

Also, I highly recommend the ICASSP Lecture which is given by Quan Wang who is the writer of this excellent paper.

IMAGE ALT TEXT

Also, I can give brief information about the lecture. Some of them is not directly related with speaker change detection. However, it can gives excellent insight how to handle with problems.

  • At Google, they use 2-stage speaker recognition: Enroll and verify. Before the verification, user enrolls her voice with speak “OK Google” and “Hey Google”. After that, they store the averaged embedding vector.

  • Generalized end-to-end loss: For the verification, they create the embedding from input via LSTM. After that, they compare the embeddings with cosine similarity. If similarity is bigger than threshold, system verify the user. To extract speaker embedding, we need define loss function.
    • Most paper use Triplet Loss. It is very simple and can correctly models the embedding space, however, can not simulate runtime behavior. This means that it can not model averaging process. So that, it is not end-to-end.
    • In 2016, writers propose tuple end-to-end loss. It can model the averaging process. However, most tuples are very easy to train. So that, it is not very efficient.

    alt text

    • To tackle with this problem, they propose generalized end-to-end loss. To train with this loss, they construct a similarity matrix for each batch. Also, in the video, you can see effiency comparision between TE2E and GE2E.

    alt text

  • Single Speaker Recognition Model For Multi-Keyword: Their dataset have 150M “OK Google” utterances and 1.2M “Hey Google” utterances. To tackle with this class imbalance, they propose Multi-Reader. This combines the loss from batches of different data sources. It is like regularization.

alt text

  • Text Independent Verification: The challenge is that length of utterance can vary. Naive solution is full sequence training, however, it can be very slow. They propose the sliding window inference. When train the system, they use the batch which include same length.

alt text

Please check the video and paper for results. Unfortunately, I can not cover all of them in this blog-post.
For the ICASSP’s presenation of the paper, you can check this video. I highly recommend it. :)

8) FULLY SUPERVISED SPEAKER DIARIZATION

This project is open-source. Please check the source code.

This paper comes from the writer who is the writer of previous paper. Previous paper use unsupervised method for clustering, however, this paper use supervised method. So that, their method is fully supervised.

They called this system as unbounded interleaved-state recurrent neural networks (UIS-RNN). They use same baseline to extract d-vector with previous paper. After the extraction, each individual speaker is modeled by parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain. With this method, system decodes in an online fashion. Also, their method is naturally integrated with ddCRP. Thus, system can learn how many speakers are there in the record.

UIS-RNN method is based on three facts:

  • We can model each speaker as an instance of RNN (These instances share same parameters)
  • We do not have any constraint to specify speaker number. System can learn-guess how many speakers are there.
  • The states of different RNN instances corresponding to different speakers. These different speakers interleaved in the time domain.

Overview of Approach

I do not want to give all mathematical backgrond. I am trying to simplify it.

  • For the sequence embbeding, we will use X. This represent d-vector of a segment.
  • For the ground truth label, we will use Y. For instance, Y = (1, 2, 2, 3, 3) These numbers represent speaker id.

UIS-RNN is a generative process.

alt text

At that formula, we do not have speaker change information. So that, we define new parameter Z to represent speaker change. Now, we have augmented represenation.

alt text

For instance corresponding Z for the Y = (1 ,1 ,2, 3, 2, 2) is Z = (0, 1, 1, 1, 0). Because, when you look the Y, you can see that for second, third and fourth transition, there is a speaker change. So that, we write 1 at corresponding locations of Z.

Note that, we can directly determine Z from Y. However, we can not uniquely determine Y from Z. Because, we can not know which speaker will come when there is a speaker change.

We can factorize the augmented representation.

alt text

Now we have

  • Sequence Generation
  • Speaker Assigment
  • Speaker Change

Speaker Change

zt represent speaker change. As we know from probability, zt is between 0 and 1.

This can be parameterized via any function. However, writers use constant value for simplicity. So that it becomes binary variables.

alt text

Speaker Assigment Process

For the speaker diarization, one of the main challenge is that determine total number of speakers. For this challenge, researchers use distance dependent Chinese restaurant process (ddCRP) which is a Bayesian non-parametric model.

When zt is 1, we know that there is a speaker change. At that point, there are 2 option. It can back to previously appeared speaker or switch to a new speaker.

  • The probability of switching back to a previously appeared speaker is proportional to the number of continuous speeches she/he has spoken. - There is also a chance to switch to a new speaker, with a probability proportional to a constant α.

alt text

Sequence Generation

“Our basic assumption is that, the observation sequence of speaker embeddings X is generated by distributions that are parameterized by the output of an RNN. This RNN has multiple instantiations, corresponding to different speakers, and they share the same set of RNN parameters θ.”

They use GRU as RNN architecture to memorize long-term.

State of GRU corresping to speaker zt:

mt = f(mt/θ) This is the output of the entire newtork.

Let t’ be the last time we saw speakert before t

t’ := max{0, s < t : ys = yt}

ht = GRU(xs’ , hs’ /θ)

Summary of the Model

alt text

Researcher omit Z and λ for simplicity.
  • Current stage, y[6] = (1, 1, 2, 3, 2, 2)

  • There are four options, it can continue with same speaker which is 2, it can back to existing speakers which are 1 and 3 or it can swtich to a new speaker which will be 4. This is based on previous label assigment y[6] and previous observation sequence x[6]

I will skip details of MLE and MAP for sake of simplicity. For the details, please check the excellent paper.
  • For Training

    System will try to maximize MLE estimation.

    alt text

  • For Testing

    System will decode and ideal goal is to find:

    alt text

Experiments and Results

  • Speaker Recognition Model

    They use three different model.

    • “d-vector V1”. This model is trained with 36M utterances from 18K US English speakers, which are all mobile phone data based on anonymized voice query logs
    • “d-vector V2”. More training data has been added to V1.
    • “dvector V3” retrained by using variable-length windows, where the window size is drawn from a uniform distribution within [240ms, 1600ms] during training.

    Results for speaker verification task.

    alt text

  • UIS-RNN Setup

    • One layer of 512 GRU cells with a tanh activation
    • Followed by two fully-connected layers each with 512 nodes and a ReLU activation.
    • The two fully-connected layers

For the evaluation they use pyannote.metrics as evaluation metrics and NIST Speaker Recognition Evaluation as dataset.

alt text

As we can see, when they use V3, their result significantly improved. Because, it uses variable-length windows.

UIS-RNN can beat offline-clustering methods even produces speaker labels in an online fashion.

Conclusions

“Since all components of this system can be learned in a supervised manner, it is preferred over unsupervised systems in scenarios where training data with high quality time-stamped speaker labels are available.”

8) Deep Speaker: an End-to-End Neural Speaker Embedding System

“We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering.”

  • They use the ResCNN and GRU to extract acoustic features.
  • Mean pool to produce utterance level speaker embeddings.
  • Train using triplet loss based on cosine similarity.

Note: They use pre-training with a softmax layer and cross entropy over a fixed list of speaker. Thus, get better generalization and small loss.

Architecture

alt text

  • Firstly, preprocess the input, convert to 64-dimensional Fbank coeefficients. After that, normalize to zero mean and unit variance.
  • Use feedforward-Dnn to extact features via ResNet or GRU.

alt text

  • Average sentence layer converts frame level input to an utterance-level speaker representation.
  • Affine and Length Normalization layers map to speaker embeddings.
  • After that, to train, they use the triplet loss. “We seek to make updates such that the cosine similarity between the anchor and the positive exam- ple is larger than the cosine similarity between the anchor and the negative example” Thus, they can avoid suboptimal local minima.

alt text

For more info, please look the paper.

Experiment

  • Outperforms a DNN-based i-vector baseline. Both methods use VAD processing.
  • They evaluate their method on three different dataset for speaker recognition task (both text-independent and text-dependent) in both Mandarin and English.
  • “Speaker verification and identification trials were constructed by randomly picking one anchor positive sample (AP) and 99 anchor negative samples (AN) for each anchor utterance. Then, we com- puted the cosine similarity between the anchor sample and each of the non-anchor samples. EER and ACC are used for speaker verification and identification, respectively.”

Text-Independent Results

alt text

“In this paper we present a novel end-to-end speaker embedding scheme, called Deep Speaker. The proposed system directly learns a mapping from speaker utterances to a hypersphere where cosine similarities directly correspond to a measure of speaker similar- ity. We experiment with two different neural network architectures (ResCNN and GRU) to extract the frame-level acoustic features. A triplet loss layer based on cosine similarities is proposed for metric learning, along with a batch-global negative selection across GPUs. Softmax pre-training is used for achieving better performance.”

9) Unspeech: Unsupervised Speech Context Embeddings

Their method is based on unsupervised learning. They train the system on up to 9500 hours English speech data with negative sampling method. They use Siamese Convolutional Neural Network architecture to train Unspeech embeddings. Their system is based on TDNN (Time-Delayed Neural Network)-HMM acoustic model to cluster.

Their idea comes from negative sampling in word2vec.

Check this blogpost to understand negative sampling.
  • System take the current segment as target and target’s left and right segments as true context. In addition to that, randomly sample four negative context.

alt text

  • System use the VGG16A network to convert these segments into embeddings.

alt text

  • Train the system as binary classification task via logistic loss.

Experiment

For same/different speaker experiment.

alt text

Preview of the paper from the writer

10) VoxCeleb2: Deep Speaker Recognition

“In this paper, we present a deep CNN based neural speaker embedding system, named VGGVox, trained to map voice spectrograms to a compact Euclidean space where distances directly correspond to a measure of speaker similarity.”

This paper is related to speaker verification, however, their method and dataset can be useful to detect speaker change points.

Their deep learning architecture consists of:

  • Deep CNN trunk architecture to extract features.
    • VGG_M and ResNet are their trunk architecture for this work. These works very well for image classification task. They just modified some part of these to make suitable for speech case.
  • Pooling layer to aggregate feature to provide a single embedding.
  • Pairwise Loss

They train VGGVox ,which is their neural embedding system, on short-term magnitude spectograms (a hamming window of width 25ms and step 10ms, without other pre-processing) in order to learn speaker discriminative embeddings via 2-step.

  • Pre-training for identification using a softmax loss. With this pre-training, they can initialize their system weights.
  • Fine-Tuning with the contrastive loss.

Their dataset both include audio and video.

alt text

Experiment

  • They train the system on the VoxCeleb2 and test on the VoxCeleb1. They use Equal Error Rate(EER) and their cost function for evaluation.
  • “During training, we randomly sample 3-second segments from each utterance.

alt text

“In this paper, we have introduced new architectures and training strategies for the task of speaker verification, and demonstrated state-of-the-art performance on the VoxCeleb1 dataset. Our learnt identity embeddings are compact (512D) and hence easy to store and useful for other tasks such as diarisation and retrieval.”

11) TEXT-INDEPENDENT SPEAKER VERIFICATION USING 3D CONVOLUTIONAL NEURAL NETWORKS

This paper is about speaker verification, however, it can give some idea about how we can use 3D CNN to create speaker models to represent different speakers.

This project is open source. Check it.

This work’s novelty comes from usage of 3D-CNN to capture speaker variations and extract the spatial and temporal information. “The main idea is to use a DNN architecture as a speaker feature extractor operating at frame and utterance-level for speaker classification.”

Also they propose one shot learning to capture speaker utterances from the same speaker, instead of average the all d-vectors of the utterances of the targeted speaker. “Our proposed method is, in essence, a one-shot representation method for which the background speaker model is created simultaneously with learning speaker characteristics.

They compare their method with Locally-Connected Network(LCN) as a baseline.

  • This network uses locally-connected layers to extract low level features, fully-connected layers to extract high level features.
  • Loss function is cross-entropy for training.
  • During the evaluation phase, cosine similarity is used.

According to writers, this baseline method is not suitable to extract enough context of speaker related information, also baseline method is affected by non-speaker related information. To tackle these issues, they propose new model. Let’s look their proposed architecture.

  • 3D CNN architecture is suitable to capture both spatial and temporal information.
  • Their input in the utterance level.

alt text

  • “Our proposed method is to stack the feature maps for several different utterances spoken by the same speaker when used as the input to the CNN. So, instead of utilizing single utterance (in the development phase) and building speaker model based on the averaged representative features of different utterances from the same speaker (d-vector system)”
  • They apply pooling operation just for the frequency domain to keep useful information which is in the time domain.

Experiment

  • They evaluate the model using the ROC (receiver operating characteristics) and PR (precision and recall) curves.
For more info, check the paper
  • They use WVU-Multimodal 2013 Dataset. “The audio part of WVU-Multimodal dataset consists of up to 4 sessions of interviews for each of the 1083 different speakers.”

  • They use modified MFCC as the data representation. MFCC has a drawback about its non-local characteristic because of last DCT operation for generating MFCC. Non-local input is not suitable for Convolutional NN, so that, they just discard the last DCT operation. Thus, they produce Mel-frequency energy coefficients (MFEC). In addition to that, window size is 20ms with 10ms stride.

  • Their model outperforms the end-to-end training fashion.

alt text

12) Deep Learning Approaches for Online Speaker Diarization

Recently, there has been more work applying deep learning to speaker diarization problem.

  • Learn speaker embeddings and use these embeddings to classify.
  • Represent speaker identity using i-vectors.
  • Bi-LSTM RNN

In this paper, researchers have tried various strategies to tackle this problem.

  • Speaker embeddings using triplet loss. (inspired by Facenet) Their model attempts to train a LSTM that can effectively encode embeddings for speech segments using the triplet loss.

    • “In training, we take segments of audio from different speakers, and construct a triple that consists of (an anchor, a positive example, a negative example) where the anchor and positive example both come from the same speaker but the negative example is from a different speaker. We then want to generate embeddings such that the embedding for the anchor is closer to the positive example embedding by some margin greater than the distance from the embedding for the anchor to the negative example embedding.”

    • “Then, when performing online diarization, we will run windows of speech through the LSTM to create an embedding for this window. If the produced vector is within some distance (using the L2 distance and a tuned threshold) of the stored current speaker vector, we deem that it is the same speaker. Otherwise, we detect that the speaker has changed, and compare the vector with the stored vector for each of the past speakers.”

alt text

Their proposed system can not capture some speaker changes which are short segments. Let’s look their result for speaker change detection.

  • “Even in this task, we found that our models had difficulty capturing speaker change. As Figure 5 indicates, speakers are mostly speaking for few seconds each time they speak in a conversation – for example, we can imagine a lot of back-and-forth consisting of short segments: “(sentence)” “yeah” “(sentence)” “sure.” As a human listener, however, often these short snippets are looked over. This makes the problem of speaker detection very challenging because the model needs to rapidly identify that the speaker has changed and must also do this often.”

13) Blind Speaker Clustering Using Phonetic and Spectral Features in Simulated and Realistic Police Interviews

This paper is related to product of Oxford Wave Research called as Cleaver. They focus on the pitch tracking. According to them, if there is any significant discontunies either in time or frequency, is used to define a candidate transition between spekaers and cluster. Let’s look their proposed method step by step.

  • Take original speech and extract the pitch track with autocorrelation based pitch tracker.
  • Perform the clustering which is based on pitch track continuities.
  • Select the most similar(divergent) cluster
  • Make agglomerative clustering to improve the result for speaker clustering.