NOTE: The following materials are presented for timely dissemination of academic and technical work. Copyright and all other rights therein are reserved by authors and/or other copyright holders. Persoanl use of the following materials is permitted and, however, people using the materials or information are expected to adhere to the terms and constraints invoked by the related copyright.


Learning Speaker-Specific Characteristics with A Deep Neural Architecture


ABSTRACT

Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition system from producing a better performance. In this paper, we propose a novel deep neural architecture especially for learning speaker-specific characteristics from Mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech and speaker recognition, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks; i.e., local yet greedy layer-wise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four LDC and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our deep neural architecture, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.


Click tnn2011.pdf for full text