Peer-reviewed audio paper

The generation of a [multi’vocal] voice

Multi’vocal Collective

6. april 2021

Fokus: Sounds of Science

DOI https://doi.org/10.48233/SEISMOGRAF2612

Living in a world where machines are talking to us with synthetic voices, it is important to discuss questions of representation and aesthetics. Today most voices in devices and systems are designed to have binary vocal identities. This could be different. Our project aims to inspire a reimagination of the paralinguistics of synthesized voices, exploring how to train and develop the pitch, timbre, pace, and other vocal features beyond speech, based on vocal data from many different people, presenting the idea of a diverse and collective voice, initiating a reflection of the sonic appearance of future synthesized speech that goes beyond the binary. In this contribution we present a first-step approach for generating a multivocal synthesized voice, listening to each stage in the training process to show how the voice develops over time with many different voices in the training pool. We describe our technical approach for training and reflect on the effectiveness of this in regard to making audible a more diverse vocal representation. In the audio paper we reflect on whether current deep learning methods are suitable for our aim of generating a multivocal voice and discuss whether bias within both the dataset and the network itself becomes prominent in the resulting voice. The generation itself perhaps offers an audible example of bias in AI. Our sonic exploration of the multivocal synthetic voice points to the difficulties of applying conventional machine learning approaches, which may be mono-domain focused, when aiming to make a diverse vocal representation audible.

Bibliography

Juutilainen, F.T. (2019) Multivocal - Creating Synthetic Voices with Non-Singular Identities. Thesis, Department of Nordic Studies and Linguistics, University of Copenhagen.

Jørgensen, S. H., Baird, A., Juutilainen, F. T., Pelt, M. and Højholdt, N. C. (2018) ‘[multi’vocal]: Reflections on Engaging Everyday People in the Development of a Collective Non-Binary Synthesized Voice’. In proceedings of the EVA conference, no pagination.

LaBelle, B. (2014) Lexicon of the Mouth: Poetics and Politics of Voice and the Oral Imaginary. London: Bloomsbury Publishing.

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang,Y., Wang, Y., Skerry-Ryan, R.J., Saurous, R. A., Agiomyrgiannakis, Y. and Wu, Y. (2018) ‘Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions’, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 4779–4783.

Schuller, B. and Batliner, A. (2014) Computational Paralinguistics – Emotion, Affect, and Personality in Speech and Language Processing. Wiley: Chichester.

Veaux, C., Yamagishi, J. and MacDonald, K. (2017) CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). DOI: 10.7488/ds/1994 [Accessed 1 September 2020].