ToucanTTS is an advanced text-to-speech (TTS) toolbox developed by the Institute for Natural Language Processing (IMS) at the University of Stuttgart, Germany2. It supports speech synthesis in over 7,000 languages and allows users to teach, train, and use state-of-the-art speech synthesis models2. The toolkit is built using PyTorch and Python, making it accessible for beginners while maintaining high functionality and performance2.
ToucanTTS is built on the FastSpeech 2 architecture at its core, with improvements including a PortaSpeech-inspired normalizing flow-based PostNet1. This design ensures natural-sounding, high-quality speech synthesis.
ToucanTTS handles multi-speaker voice synthesis by using trainable speaker embeddings, which allows the generation of different voices from a single model1. This feature enables users to mimic the rhythm, stress, and intonation of several speakers, making it useful for applications that require stylistic diversity and voice customization3.