The sound of a person’s voice is an important factor in human communication. VoiceConversion (VC) is a technology that modiﬁes a source speaker’s speech utterance to sound as if it has been spoken by a target speaker. VC oﬀers a number of useful applications. For example, personalizing a text-to-speech system to speak with a new voice with minimal amount of data, or mimicking the voice of another individual when dubbing a movie in another language.
In this dissertation, we consider new approaches in the design of VC systems. We propose techniques for learning speech representations with some characteristics that facilitate building systems for VC. In a ﬁrst approach, we propose to learn artiﬁcially-enforced similar representations from both source and target speakers’ speech features. This allows the encoding of source speaker features to a representation which can then be used to decode the target speech features. We name this architecture joint autoencoder. We investigate the behaviors of this model through objective and subjective evaluations.