
Whose voice is your voice?
Abstract
MP3 changed the aural world as the first digital technology to compress audio, interpreting it through perceptual coding that models the standard human, hearing standard sounds, in order to sell and share them. Today, in the era of streaming and virtual meetings, bandwidth has become an even scarcer commodity: Artificial Intelligence comes to help, but the current codecs, such as Lyra by Google, or Opus, used by Whatsapp and Discord, present a sound reconstructed on the basis of their reference corpuses, and fine-tuned for speech – »good enough«, as defined by Sterne. A model is trained to efficiently distinguish noise from voiced information, and to rebuild a synthetic simulacrum of them, neutering context. When this happens, whose voice is your voice? Whose ears are your ears? And if, following Marius Schneider, sounds create the world, what is the world those models remember?
The research stems from a technical analysis of these codecs, their bias and their proven limits and virtues, to focus on the consequences of this widespread acceptance of apparently transparent audio transmission: a social, efficiency-driven homogenization of the soundscape; the elimination of more-than-human sounds. leading to a deeper dissonance in authenticity of the acoustic ecology; and creative possibilities hidden in abusing the emergence of meta-human debris.
Special thanks to Luigi Monteanni, Sissj Bassani and Pier Paolo Zimmermann.
This work is dedicated to the memory of Jonathan Sterne.
NOTES
0’00’’ music: Maguire, 2014
0’42’’ music: Bienoise, 2018
58’’ »just good enough« (Sterne, 2012, p. 167)
1’00’’ »commercially valuable« (Sterne, 2003, p. 230) foreground/ background: see Sterne, 2003, pp. 25 and 259
1’03’ music: Cooke, 2024
1’10’’ for a definition of more-than-human, see Price, 2023
1’40’’ »the myth of transparency«’ (Brooks, 2015, p.40, and Kelly, 2009, p.172)
2’40’’ see Opus, 2024
3’05’’ sexual biases in the Opus algorithm: see Bolton, 2022
4’40’’ »audiovideo quality becomes a cause of rejection« \ 5’40’’ “the widening of the digital divide”: see Fiechter, 2018 »a deeper dissonance in authenticity of the acoustic ecology«: see Karpel, 2025
5’08’’ about Lyra, see Skoglung, 2023
10’08’’ emotional cues: see Pramod, 2023, and Ren, 2024
10’34’’ »MPEG audio is processed sound for listeners living in a a processed world« (Sterne, 2012, p. 159)
10’46’’ »a neutered non-place«: see Kromhout, 2009 »preset reactions«: see Han, 2017
12’49’’ metahuman sounds: the definition is inspired by Gourgouli, 2023
13’46’’ music: Moore, 1976
13’54’’ »striving for the authenticity of a garage recording, with an anti commercial attitude«: see Kromhout, 2009, and Harper, 2014.
14’04’’ music: Oval, 1995
14’07’’ »amplifies the errors«...»shatters the myth of their transparency« (Brooks, 2015, p.40, and Kelly, 2009, p.33,172)
14’22’’ »connections with punk and free jazz«: see Kelly, 2009, p. 180, 279»while destabilizing the centrality of the author« (Brooks, 2015, p.38)
14’34’’ music: Bienoise, unreleased
15’11’’ »close-up on the face«...»smoothness«: see Han, 2017
15’39’’ »visual counterparts in AI art«: see ragnar_meta, 2025
