Speech Representation Learning for Voice Conversion Public Deposited

The sound of a person’s voice is an important factor in human communication. VoiceConversion (VC) is a technology that modifies a source speaker’s speech utterance to sound as if it has been spoken by a target speaker. VC offers a number of useful applications. For example, personalizing a text-to-speech system to speak with a new voice with minimal amount of data, or mimicking the voice of another individual when dubbing a movie in another language.

In this dissertation, we consider new approaches in the design of VC systems. We propose techniques for learning speech representations with some characteristics that facilitate building systems for VC. In a first approach, we propose to learn artificially-enforced similar representations from both source and target speakers’ speech features. This allows the encoding of source speaker features to a representation which can then be used to decode the target speech features. We name this architecture joint autoencoder. We investigate the behaviors of this model through objective and subjective evaluations.

  • https://doi.org/10.6083/z316q221w
  • mohammadi.hamidreza.2019.pdf
Publication Date
  • 2019
Document type