Speech Densely Connected Convolutional Networks
for Small Footprint Keyword Spotting
Abstract:
In a society where human-computer interaction is becoming increasingly important, voice assistants that use voice recognition to drive or control devices are becoming more common. However, many voice assistant users are concerned about personal privacy. The reason is that most of the voice assistants in the market will capture the voice of the command and upload it to the cloud for processing, and these voice clips will be temporarily stored by the company providing the service. To solve this problem, keyword spotting at the end of computing is an important task in voice human-computer interaction. For high privacy, the identification task needs to be performed at the edge, so the purpose of this task is to improve the accuracy as much as possible within the limited cost.
This paper discusses the application of Densely Connected Convolutional Networks (DenseNet) to the keyword spotting task. To make the model smaller, we replace the normal convolution with group convolution and depthwise separable convolution. To increase the accuracy, we add squeeze-and-excitation networks (SENet) to enhance the weight of important features. In order to investigate the effect of different convolutions on DenseNet, we built three models: SpDenseNet-G, SpDenseNet-D, and SpDenseNet-L and generated individual compact variants for each model.
We validated the network using the Google Speech Commands Dataset. Our proposed network had better accuracy than other networks even with less number of parameters and floating-point operations (FLOPs). SpDenseNet-D could achieve the accuracy of 96.3% with 122.63K trainable parameters and 142.7M FLOPs. Compared to the benchmark paper, only about 52% of the number of parameters and about 12% of the FLOPs are used. In addition, we varied the depth and width of the network to build a compact variant. It also outperforms other compact variants, SpDenseNet-L-narrow could achieve the accuracy of 93.6% with 9.27K trainable parameters and 3.47M FLOPs. Compared to the benchmark paper, our accuracy improves by 3.5% and uses only about 47% of the number of parameters and about 48% of the FLOPS.
Network Architecture:
Made
by ªLªY¼z