Abstract:
Text-to-speech (TTS) systems are ever more embedded in our daily interactions: In social media and other content platforms, TTS technologies are used to create voiceovers or to narrate videos, and they enable the creation of sophisticated voice assistants integrated into (digital) devices. The proliferation of TTS systems is due to their ever-improving ability to generate audio that mimics human speech. Moreover, over the past few years, they also gained the ability to clone the voice of a target individual. This specific capability raises the question of how voices similar to the listeners influence cognitive processes.
Existing research has consistently demonstrated that subjects that share attributes with ourselves are generally perceived as more favorable. The Similarity-Attraction theory has not only been validated across various domains – including physical appearance, attitudes, ethnicity, and origin – but research has also demonstrated that similarity alters our perception and influences attitudes and behavior. This is especially the case when we process information peripherally or heuristically – for example when we lack motivation or relevant skills. In such circumstances, similarity can serve as an influential cue. In line with the Computers Are Social Actors hypothesis, which suggests that humans extend social rules and expectations to interactions with computers and similar devices, it is plausible to infer that artificial voices perceived as similar to our own might evoke comparable reactions. Despite this theoretical foundation, research on the effects of voice similarity remains relatively scarce. In 12 experiments, this thesis investigates the effects of voice similarity on trait evaluation, information reception, and decision-making.
In the first manuscript, I tested whether a speaker recognition system trained with deep learning methods can predict human similarity judgments and whether the derived similarity of voices affects trait evaluation (Chapter 2). In all five experiments in this series, I employed an open-source speaker recognition system that generates d-vectors based on brief audio samples. D-vectors are numerical representations of the unique auditory features of a speaker and can be used to assess the similarity of two different speakers, for example, by calculating the cosine similarity value between their feature vectors. To get a German model of the SV system, I combined several datasets from within the web and trained the SV with audio samples from approximately 10.000 speakers. The first three experiments revealed a modest yet significant correlation between the system's cosine similarity values and human similarity judgments. Therefore, I demonstrated that the cosine values could be utilized as a proxy for human-perceived similarity. In the fourth experiment, I did not investigate the consequences of similarity but used the cosine values to address the question of whether average voices elicit higher likeability and trustworthiness ratings. Existing research has revealed such a Beauty-in-Averageness effect for faces and several other stimuli, but there is little research on average voices. However, my findings did not support its presence in voice perception, showing no link between vocal averageness and trait evaluation. Conversely, the final experiment revealed a positive relationship between trustworthiness and likeability judgments with the degree of voice similarity to the listener. Consequently, the results indicate that subjects with a similar voice are evaluated more favorably.
In the second series of experiments, I investigated whether voice similarity affects decision-making (Chapter 3). For the first experiment, I adopted a standard probabilistic inference paradigm that has been used in various studies on decision-making. Participants are incentivized to find treasures hidden behind three houses and receive guidance from three advisors, each making predictions about the treasure’s location. Past research has demonstrated the inability of participants to make optimal decisions, especially when two advisors with low predictive ability make coherent predictions that differ from the predictions of the advisor with high predictive ability. This phenomenon highlights the influence of heuristic or peripheral cues on decision-making – here, the majority rule. Before conducting the experiment, I gathered audio recordings from approximately 600 individuals articulating phrases such as ‘There is a treasure.’ and ‘There is a spider.’ In the initial experiment, I manipulated whether the advisor with a high prediction capability possessed a voice similar to the participant’s or one of average similarity. Contrary to expectations, voice similarity did not significantly increase the likelihood of following the predictions of the high-capability advisor. In the two subsequent experiments, I simplified the setup, incorporated only two advisors with equal prediction capabilities, and varied whether the advisor with a similar voice was on the top or the bottom of the display. In both experiments, I found an interaction effect between voice similarity and the advisor's position on how often the participants followed the advice of the similar advisor.
My final experimental series (Chapter 4) aimed to test the impact of voice similarity on information processing, especially on truth perception. Previous studies have shown that peripheral cues, such as familiarity with a statement, can enhance the likelihood of perceiving that statement as accurate. This phenomenon is called the Illusory-Truth effect and can be caused, for instance, by presenting a statement repeatedly. I investigated whether voice similarity modulates the Illusory-Truth effect. The first experiments were designed to validate the experimental materials and assess whether the Illusory-Truth effect emerges with auditory statements generated by a TTS system. In the final experiment in this series, we presented trivia statements (‘Paris is the capital of Germany.’) either once or twice and either with a similar or an average voice. The results revealed a marginally significant interaction between the repetition of a statement and the voice similarity on truth evaluation. Moreover, false trivia statements were more often judged as true when presented twice and when a voice similar to the participant’s voice delivered the statement. Therefore, the results revealed a significant effect of voice similarity on truth perception.
Overall, my thesis introduced the use of d-vectors as a new approach to investigate the effects of voice similarity on various cognitive processes. My results consistently demonstrated a (subtle) effect of voice similarity on trait evaluation, decision-making, and truth perception and raised critical questions about the possible integration of user-tailored voices into TTS systems.