Dual-Sequences Gated Attention Unit Architecture

for Speaker Verification

Abstract:

Speaker verification (SV), is the progress of verifying a person's claimed identity from their voice characteristics which are recorded by a device such as a microphone. A speaker verification system can be text-dependent and text-independent cases. In the former case, the provided utterances are given in a form of fixed text such as a fixed password or certain words and employed for all stages in the speaker verification process. In the latter case, a more flexible system is operated, where the speakers can say whatever they want to the system. This paper focuses on the text-independent scenario. In recent years, the great success of deep learning approaches has proven its potential in machine learning fields. Especially, in speaker verification (SV) tasks. Thus, a variant of GRU is proposed, denoted as Dual-Sequences Gated Attention Unit (DS-GAU) that is goal is to improve the captured information of the x-vector at different levels. We explore a speaker-embedding model, which integrated TDNN with the DS-GAU structure, and compared it to the x-vector baseline in different cases of hidden state size. The motivation of using GRU is to better capture the speaker characteristics from speech signal than only use the TDNN as the x-vector baseline system. The proposed system was then evaluated on the VoxCeleb1 datasets and the Speaker in The Wild (SITW), where different scenarios with different numbers of pair utterances are considered.

 

Network Architecture:

 

 

 

 

 

Made by  ³¯µn°ê