© Pier Paulo Zimmermann

Whose voice is your voice?

Control, identity and metahuman sounds in AI audio compression codes 
Af
18. september 2025
Fokus: Sound and the More-Than-Human Worlds
  DOI https://seismograf.org/node/20824

Abstract

MP3 changed the aural world as the first digital technology to compress audio, interpreting it through perceptual coding that models the standard human, hearing standard sounds, in order to sell and share them. Today, in the era of streaming and virtual meetings, bandwidth has become an even scarcer commodity: Artificial Intelligence comes to help, but the current codecs, such as Lyra by Google, or Opus, used by Whatsapp and Discord, present a sound reconstructed on the basis of their reference corpuses, and fine-tuned for speech – »good enough«, as defined by Sterne. A model is trained to efficiently distinguish noise from voiced information, and to rebuild a synthetic simulacrum of them, neutering context. When this happens, whose voice is your voice? Whose ears are your ears? And if, following Marius Schneider, sounds create the world, what is the world those models remember?

The research stems from a technical analysis of these codecs, their bias and their proven limits and virtues, to focus on the consequences of this widespread acceptance of apparently transparent audio transmission: a social, efficiency-driven homogenization of the soundscape; the elimination of more-than-human sounds. leading to a deeper dissonance in authenticity of the acoustic ecology; and creative possibilities hidden in abusing the emergence of meta-human debris.

Special thanks to Luigi Monteanni, Sissj Bassani and Pier Paolo Zimmermann.

This work is dedicated to the memory of Jonathan Sterne.


NOTES

0’00’’ music: Maguire, 2014

0’42’’ music: Bienoise, 2018

58’’ »just good enough« (Sterne, 2012, p. 167)

1’00’’ »commercially valuable« (Sterne, 2003, p. 230) foreground/ background: see Sterne, 2003, pp. 25 and 259

1’03’ music: Cooke, 2024

1’10’’ for a definition of more-than-human, see Price, 2023

1’40’’ »the myth of transparency«’ (Brooks, 2015, p.40, and Kelly, 2009, p.172)

2’40’’ see Opus, 2024

3’05’’ sexual biases in the Opus algorithm: see Bolton, 2022

4’40’’ »audiovideo quality becomes a cause of rejection« \ 5’40’’ “the widening of the digital divide”: see Fiechter, 2018 »a deeper dissonance in authenticity of the acoustic ecology«: see Karpel, 2025

5’08’’ about Lyra, see Skoglung, 2023

10’08’’ emotional cues: see Pramod, 2023, and Ren, 2024

10’34’’ »MPEG audio is processed sound for listeners living in a a processed world« (Sterne, 2012, p. 159)

10’46’’ »a neutered non-place«: see Kromhout, 2009 »preset reactions«: see Han, 2017

12’49’’ metahuman sounds: the definition is inspired by Gourgouli, 2023

13’46’’ music: Moore, 1976

13’54’’ »striving for the authenticity of a garage recording, with an anti commercial attitude«: see Kromhout, 2009, and Harper, 2014.

14’04’’ music: Oval, 1995

14’07’’ »amplifies the errors«...»shatters the myth of their transparency« (Brooks, 2015, p.40, and Kelly, 2009, p.33,172)

14’22’’ »connections with punk and free jazz«: see Kelly, 2009, p. 180, 279»while destabilizing the centrality of the author« (Brooks, 2015, p.38)

14’34’’ music: Bienoise, unreleased

15’11’’ »close-up on the face«...»smoothness«: see Han, 2017

15’39’’ »visual counterparts in AI art«: see ragnar_meta, 2025


 

© Pier Paulo Zimmermann
© Pier Paulo Zimmermann

 

Bibliography

1 Attali, J. (1977). Noise: the political economy of music (B. Massumi, Trans.). Minneapolis:University of Minnesota Press.

2 Bolton, M.L. (2022). Preliminary Evidence of Sexual Bias in Voice over Internet Protocol Audio Compression. Lecture notes in computer science, pp. 227–237. doi:https://doi.org/10.1007/978-3-031-05409-9_17.

3 Brooks, A. (2015). Glitch/Failure: Constructing a Queer Politics of Listening. Leonardo Music Journal, 25(25), pp.37–40. doi:https://doi.org/10.1162/lmj_a_00932.

4 Cascone, K. (2000). The Aesthetics of Failure: »Post-Digital« Tendencies in Contemporary Computer Music. Computer Music Journal, 24(4), pp.12–18. doi:https://doi.org/10.1162/014892600559489.

5 Défossez, A., Copet, J., Synnaeve, G. and Adi, Y. (2022). High Fidelity Neural Audio Compression. arXiv (Cornell University). doi:https://doi.org/10.48550/arxiv.2210.13438.

6 Denton, T., Luebs, A., Felicia, Storus, A., Yeh, H., Bastiaan, K.W. and Skoglund, J. (2021). Handling Background Noise in Neural Speech Generation. arXiv (Cornell University). doi:https://doi.org/10.48550/arxiv.2102.11906.

7 Fiechter, J.L. et al. (2018) »Audiovisual quality impacts assessments of job candidates in video interviews: Evidence for an AV quality bias«, Cognitive Research Principles and Implications, 3(1). doi:https://doi.org/10.1186/s41235-018-0139-y.

8 Gourgouli, N. (2023). Mutation in Human Nature. The Doll as a Posthuman Being and the Formless Metahuman as ‘Other’. Journal of Posthumanism, 3(2), 163–180. https://doi.org/10.33182/joph.v3i2.2952

9 Karpel, H. (2025). Audiologists raise concern over headphone use in young people. [online] 16 Feb. Available at: https://www.bbc.com/news/articles/cgkjvr7x5x6o.

10 Kromhout, M. (2009). As distant and close as can be: lo-fi recording: site-specificity and (in)authenticity. dare.uva.nl. [online] Available at: https://hdl.handle.net/11245/1.350118.

11 Han, B.C. (2017). Saving beauty. Cambridge: Polity Press.

12 Hardwick, J. (2023). Audio Codecs and the AI Revolution. [online] The MCT Blog. Available at: https://mct-master.github.io/networked-music/2023/04/23/jackeh-audio-codecs-ml.html [Accessed 15 Sep. 2024].

13 Harper, A. (2014). Lo-Fi aesthetics in popular music discourse [PhD thesis]. Oxford University, UK.

14 Kelly, C. (2009). Cracked media: the sound of malfunction. Cambridge, Mass.: Mit Press.

15 MacPhail, A.G., Yip, D.A., Knight, E.C., Hedley, R., Knaggs, M., Shonfield, J., Upham-Mills, E. and Bayne, E.M. (2023). Audio data compression affects acoustic indices and reduces detections of birds by human listening and automated recognisers. Bioacoustics – The International Journal of Animal Sound and its Recording, pp.1–17. doi:https://doi.org/10.1080/09524622.2023.2290718.

16 Nogales, A., Caracuel-Cayuela, J. and García-Tejedor, Á.J. (2024). Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression. Applied sciences, 14(2), pp.740. doi:https://doi.org/10.3390/app14020740.

17 Opus Development Team. (2024). Opus 1.5 Released. [online] Available at: https://opus-codec.org/demo/opus-1.5/ [Accessed 15 Dec. 2024].

18 Perepelytsia, V. and Dellwo, V. (2023). Acoustic compression in Zoom audio does not compromise voice recognition performance. Scientific Reports, [online] 13(1), p.18742. doi:https://doi.org/10.1038/s41598-023-45971-x.

19 A. Pramod Reddy et al. (2023). Estimating the Effects of Voice Quality and Speech Intelligibility of Audio Compression in Automatic Emotion Recognition. International Journal of Image Graphics and Signal Processing, 15(3), pp.69-80. doi:https://doi.org/10.5815/ijigsp.2023.03.06.

20 Price, C. and Chao, S. (2023). Multispecies, More-Than-Human, Nonhuman, Other-Than-Human. Exchanges: The Interdisciplinary Research Journal, 10(2), pp.177-193. doi:https://doi.org/10.31273/eirj.v10i2.1166.

21 ragnar_meta (2025). intro_to_ai_art. [online] Available at: https://deca.art/ragnar_meta/intro_to_ai_art [Accessed 10 May 2025].

22 Ren, W. et al. (2024). EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations. [online] arXiv.org. Available at: https://arxiv.org/abs/2407.15458 [Accessed 15 Sep. 2024].

23 Schneider, M. Primitive Music. In: E. Wellesz (Hrsg.): New Oxford Hist. of Music. 1: Ancient and Oriental Music. 1957, S. 1-82

24 Skoglung, J. (2023). Can AI Disrupt Speech Compression? | Jan Skoglund. [online, @scale] YouTube. Available at: https://www.youtube.com/watch?v=dAiHuHApEs8 [Accessed 15 Sep. 2024].

25 Skoglund, J. (2023). Speech and Audio Compression in the Neural Era: Jan Skoglund. [online, Stanford Research Talks] YouTube. Available at: https://www.youtube.com/watch?v=Eolt9j8vvjw [Accessed 15 Sep. 2024]

26 Sterne, J. (2003). The audible past: cultural origins of sound reproduction. Durham: Duke University Press.

27 Sterne, J. (2006). The Mp3 as Cultural Artifact. New Media & Society, 8(5), pp.825-842. doi:https://doi.org/10.1177/1461444806067737.

28 Sterne, J. (2012). MP3: The Meaning of a Format. Durnham: Duke University Press.

29 Sterne, J. (2015). »Compression: A Loose History«, in Parks, L. and Starosielski, N. Signal Traffic: Critical Studies of Media Infrastructures. Champaign: University of Illinois Press, pp. 31-52.

30 Zeghidour, N. and Tagliasacchi, M. (2021). SoundStream: An End-to-End Neural Audio Codec. [online] Available at: https://research.google/blog/soundstream-an-end-to-end-neural-audio-codec/ [Accessed 15 Sep. 2024].

REFERENCED MUSIC

Cooke, S. (2024). Music Speech Crowd Noise Wind Nature. Available at: https://sethcooke.bandcamp.com/album/music-speech-crowd-noise-wind-nature [Accessed 18 May 2025]

Lucier, A. (1969). I Am Sitting in a Room

Moore, R.S. (1976) Goodbye Piano. Nashville: Vital Records

‌Maguire, R. (2014). moDernisT. Available at: https://rpm7.bandcamp.com/album/ghost-in-the-mp3 [Accessed 15 Sep. 2024]

Oval (1995). Store Check. Frankfurt: Mille Plateaux.

Ricca, A. (2018, as Bienoise). To Save, To Share. Frankfurt: Mille Plateaux.

Keywords

metahuman
audio compression
artificial intelligence
low fidelity
glitch

Fokusartikler