Can a machine accurately predict how adults articulate words?

Funding Round: 2 2014-2016

Research Questions: Can a machine accurately predict how adults articulate words? Can machine feedback help adults learn how to correctly pronounce sounds in a novel language?

Interdisciplinary Approach: This project bridges a new computational theory of language acquisition with automatic speech recognition system techniques to develop a new “multi-view” computer model that can assist humans in language learning.  

Potential Implications of Research: This research will provide a solid foundation for (a) exploring the best parameters for computer-assisted rehabilitation for disorders of articulation (e.g., apraxia of speech), and (b) investigating considerably cheaper and more effective data-driven approaches to automated speech-recognition.

Project Description: Human babies learn to produce speech by mimicking the sounds that they hear. To say “banana”, for example, a baby must break down the word into a string of sounds - buh-nan-uh - and recreate the sounds by expelling air through the correct series of mouth, tongue, and lip movements (“articulatory patterns”). In effect, language acquisition relies on humans’ ability to map

information from a sound system to a motor system. This process seems to be effortless for babies and, for the most part unsupervised, suggesting that there must be a powerful cognitive mechanism that guides the mapping of information across the two systems. Yet this is no a simple feat. Each system processes a multitude of information. Not only should the cognitive mechanism achieve a reasonable partitioning of information within each system (e.g., distinguish the sound “buh” from “duh”), but it should also find correlations across the two systems (e.g., “buh” corresponds to a specific articulatory pattern). To date, machine-learning models have been unable to capture the complexity of this mechanism; however, recent breakthroughs in “multi-view” modeling techniques may provide new ways for machines to synthesize vast amounts of information from different systems. Combining theories of language acquisition from cognitive psychology and new machine learning approaches from computer science, our project will develop the first “multi-view” machine model designed to assist human language learning. 

Our project follows two sets of aims.  Aim 1 is to develop the multi-view machine model and evaluate its suitability for training new articulatory patterns in human subjects. We will develop a machine learning algorithm that maps speech to corresponding articulatory patterns. We will then test how well the machine can predict the correct articulation pattern upon hearing a person speak a particular sound or word in his/her native language. Next, we will test how well the machine can train human subjects to correctly produce sounds in a novel language. The machine will evaluate each subject’s pronunciation of a novel sound and articulatory pattern and provide feedback.  Aim 2 seeks to enhance the multi-view machine model in three ways: (1) develop deep neural networks that synthesize multiple layers of information, (2) develop dynamic feature learning techniques capable of tracking critical articulation elements, and (3) develop visualization tools that provide individualized articulatory feedback to learners (see figure).