NOTE: The following materials are presented for timely
dissemination of academic and technical work. Copyright and all other rights
therein are reserved by authors and/or other copyright holders. Persoanl
use of the following materials is permitted and, however, people using
the materials or information are expected to adhere to the terms and
constraints invoked by the related copyright.
Learning Speaker-Specific Characteristics with A Deep Neural Architecture
ABSTRACT
Speech signals convey various yet mixed information ranging from linguistic to speaker-specific
information. However, most of acoustic representations characterize
all different kinds of information as whole, which could hinder either a speech or a speaker recognition system
from producing a better performance. In this paper, we propose a novel deep neural architecture especially for
learning speaker-specific characteristics from Mel-frequency cepstral coefficients,
an acoustic representation commonly used in both speech and speaker recognition, which results
in a speaker-specific overcomplete representation.
In order to learn intrinsic speaker-specific characteristics, we come up with an objective
function consisting of contrastive losses in terms of speaker similarity/dissimilarity
and data reconstruction losses used as regularization to normalize the interference of non-speaker
related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural
networks; i.e., local yet greedy layer-wise unsupervised pretraining for initialization and global
supervised learning for the ultimate discriminative goal. With four LDC and two non-English corpora,
we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether
their utterances have been used in training our deep neural architecture,
and highly insensitive to text and languages spoken. Extensive comparative studies
suggest that our approach yields favorite results in speaker verification
and segmentation. Finally, we discuss several issues concerning our proposed approach.
Click
tnn2011.pdf
for full text