How do listeners adapt to speaker-specific acoustic variation in speech?

Funding Round: 1 2015-2017

Research Question: What is the extent of talker variation in the acoustic-phonetic realization of speech sounds? Is variation in phonetic realization structured (or correlated) among speech sounds and across talkers? Can listeners use correlations among speech sounds to adapt to the voice of a novel talker?

Interdisciplinary Approach: This fellowship integrates methods from acoustic phonetics, speech perception, and automatic speech recognition (ASR) to address the range and limits of speaker variability that must be learned for successful speech perception.

Potential Implications of Research: The results from this project can inform techniques for improved speaker adaptation in both cognitive and automatic speech recognition systems, as well as speech perception in hearing impaired and cochlear implant patients.

Project Description: 

Speakers of the same language differ significantly in their speech and language patterns. One only needs to recall the voices of a few people to recognize the range present in pitch, nasality, enunciation, and many other phonetic characteristics. Take for example the baritone of Morgan Freeman, the nasality of Steve Buscemi or Fran Drescher, or the enunciated, highly “aspirated” /t/s and /k/s of Tom Brokaw or almost any newscaster. These acoustic-phonetic differences arise from various sources including the speaker’s vocal tract, dialect (or ‘accent’), individual articulatory habits, among many others. Even with this vocal variation, listeners are able to adapt to and understand novel talkers with seemingly little effort. That is, if Fran Drescher and Tom Brokaw both produced the same sentence, the sound waves would be quite different, and even more so than if one of the speakers produced the same word twice. Nevertheless, listeners can derive the linguistic meaning from a speaker’s acoustic utterance almost automatically. Underlying this ability is a complex cognitive system unmatched by even the most advanced ASR technologies (e.g. Siri, Google Voice, and other speech-to-text systems). The present project aims to better understand this cognitive ability by examining the range, limits, and structure of speaker variation. If speakers vary in systematic ways on dimensions such as pitch, nasality or their aspiration of [tʰ] and [kʰ], this structure can be used by cognitive and machine processes for rapid learning of the individual’s unique speech pattern.

To investigate patterns in acoustic-phonetic variation, the project examined how phonetic properties (such as pitch and aspiration) were correlated among different speech sounds across talkers. Specifically, is the way in which a talker produces [kʰ] (as measured on one acoustic dimension) related to how the same talker produces [pʰ] or [tʰ] (on that same dimension)? The project focused primarily on analysis of talker variation in the realization of stop consonants ([pʰ tʰ kʰ b d g]) and sibilant fricatives ([s z ʃ ʒ]) in several large spoken corpora of American English and Czech that contained up to 200 speakers. Substantial acoustic-phonetic variation was identified across talkers in various properties relevant for the perception of stops and fricatives. This variation was also highly structured among speech sounds: the researchers found a strong degree of talker covariation among stop consonants and among fricatives. Despite overall differences in the absolute realization of speech sounds, talkers maintained similar relations between categories.

Phonetic covariation among speech sounds has direct implications for talker adaptation: a listener could generalize how a talker produces speech sound A to speech sound B without prior exposure to speech sound B. For example, from the single utterance, ‘can I help you’, a listener could learn about the [kʰ] sound in ‘can’ to predict how the speaker would likely produce the related speech sounds [pʰ] and [tʰ]. The researchers found that listeners could generalize talker-specific properties across stop categories in perceptual adaptation, with exposure to only a subset of the stop categories. Parallel findings were also identified in perceptual adaptation to fricatives (e.g., generalization of spectral properties in [z] to [s]) in a way that is consistent with the observed phonetic covariation in speech production.

Further insight into speaker variation from the acoustic and cognitive perspectives can inform techniques for learning and improving speech production and perception. This could apply to individuals learning a second language; those with significant hearing loss; or cochlear implant recipients, who may not have any previous exposure to speech. Patterns uncovered in speaker variability may benefit individuals who must learn or improve speech production (as in second language learning) or speech perception with conscious effort. The correlational relationships across talkers may also refine speaker adaptation methods in automatic speech recognition, especially for languages in which speech training data is limited (e.g., low resource languages). Overall, the project broadens our current understanding of phonetic variation and speech perception, with a focus on how variation is structured and the implications of this structure for speaker adaptation by humans and machines.

The project also included two primary dissemination projects: one to the general public and a second to the scientific and engineering communities. The research fellow visited an all-girls high school in Baltimore, Maryland and gave an interactive presentation on several concepts behind voice recognition technology used in smart phones and other devices. The goals of this presentation were to Introduce high school students to high-level concepts in phonetics, automatic speech recognition, and automatic speaker recognition, increase awareness of and inspire interest in these topics, and make smarter consumers (of both technology and language). For the science and engineering communities, the fellow created an online tutorial of several speech processing tools based on automatic speech recognition. The overarching goals of this project were to facilitate data processing in both scale and speed for better and more efficient research and advance the state-of-the-art in speech science and technology. The website has been widely used around the world in developing ASR systems and conducting phonetic research: in the past 10 months, over 1500 unique users from 95 countries actively engaged with the website.


Associated File(s):