Important dates

  • 26th March 2021: Paper submission at INTERSPEECH'21.
  • 2nd June 2021: Notification of acceptance
  • 15th June 2021: Camera-ready paper

Click here if you want to receive updates

While learned representations from the speech signal are often used in speech processing and synthesis tasks, these latent spaces are not well understood. This makes it hard for researchers and developers to interpret, transfer and modify them in a meaningful way. This special session at INTERSPEECH’21 aims to bring together researchers investigating automatically learned representations. We specifically encourage interdisciplinary submissions, to tackle this research goal from different angles.

Submit a paper


What the special session is about

Learning prosodic representations from the speech signal have become common practice in emotion classification and often yield a better performance over the respective baselines. Likewise, novel TTS systems learn prosodic embeddings that allow them to produce prosodically varied, expressive speech. However, it often remains opaque how to interpret these learned prosodic representations, let alone how to meaningfully modify them or to use them to perform specific tasks.

For example, lots of research went into the development of TTS systems creating realistic prosodic variation, while it largely remains unclear how to modify prosody to elicit emotional speech. This is a central requirement to enable authentic machine human interaction. In the field of emotion recognition, modern approaches are focusing on learning features in an end to end fashion. These features yield a good classification performance, but the features themselves cannot be interpreted directly, which often leads to hits and misses during training. Tools to explore these feature spaces are needed to infer what those networks learned.

This special session focuses on the interpretation, modification and application of learned prosodic representations in emotional speech classification and synthesis. It aims to bring together the different communities of explainable artificial intelligence (XAI), synthesis and classification to tackle a common problem.


To be announced shortly


Will be published here upon acceptance


Meet the organizers

Pol van Rijn

Pol van Rijn is a PhD candidate in the department of Neuroscience at the Max Planck Institute for Empirical Aesthetics. By training, he is a computer scientist and linguist (bachelor’s and master’s degree, focus on phonology). During his PhD, he investigates the mapping between emotional speech and its acoustic content. He explores this mapping both in existing corpora (classification) and addresses it experimentally (synthesis, see recent paper).

Dominik Schiller

Dominik Schiller received his M.Sc. degree in Computer Science from the University of Augsburg. He is currently pursuing his PhD under the supervision of Prof. Dr. Elisabeth André at the Lab for Human-Centered Artificial Intelligence. His research interests include multimodal emotion recognition as well as deep- and transfer learning.

Silvan Mertes

Silvan Mertes is a PhD candidate at the Lab for Human-Centered Artificial Intelligence at the University of Augsburg. His research focuses on Generative Adversarial Learning for audio and image synthesis. Specifically, he explores how adversarial learning approaches can enhance datasets and explainability for different deep learning tasks. Furthermore, he works on adversarial speech conversion.

Call for papers

The special session 'Learned Prosodic Representations in Emotional Speech Classification and Synthesis' is part of the main INTERSPEECH conference 2021 Brno, Czechia. It focuses on the interpretation, modification and application of learned prosodic representations in emotional speech classification and synthesis.

Topics of interest include, but are not limited to:

  • Latent space exploration in emotional speech
  • Interpretation of learned representations using methods from Explainable Artificial Intelligence (XAI)
  • Controlled embedding modification in expressive speech synthesis
  • Application and comparison of learned representations for classification

The session consists of oral presentations, and aims to bring together the different communities of XAI, synthesis and classification to tackle a common problem. We strongly encourage interdisciplinary submissions that combine methods from synthesis, analysis and explainable artificial intelligence. Papers submitted to this Special Session follows the same schedule and procedure as regular INTERSPEECH’21 papers. All papers will undergo the same review process by anonymous and independent reviewers. Special session organizers will be involved in the final decisions. Please keep in mind the important dates. In case of questions, contact Pol van Rijn.