Fusion of bottleneck, spectral and modulation spectral features for improved speaker verification of neutral and whispered speech
Sarria Paja, Milton
Falk, Tiago H.
MetadataShow full item record
Speech based biometrics is becoming a preferred method of identity management amongst users and companies. Current state-of-the-art speaker verification (SV) systems, however, are known to be strongly dependent on the condition of the speech material provided as input, and can be affected by unexpected variability presented during testing, such as with environmental noise or changes in vocal effort. In this paper, SV using whispered speech is explored, as whispered speech is known to be a natural speaking style with reduced perceptibility but containing relevant information regarding speaker identity and gender. We propose to fuse information from spectral, modulation spectral and so-called bottleneck features computed via deep neural networks at the feature- and score-levels. Bottleneck features have been recently shown to provide robustness against train/test mismatch conditions and have yet to be tested for whispered speech. Experimental results showed that relative improvements as high as 79% and 60% could be achieved for neutral and whispered speech, respectively, relative to a baseline system trained with i-vectors extracted from mel frequency cepstral coefficients. Results from our fusion experiments, show that the proposed strategies allow to efficiently use the limited resources available and to result in whispered speech performance inline with that obtained with normal speech.