Home icon Home»Archive»Volume 6, Issue 2»Vowel Play with Algorithms: Helping Humans and Computers Learn Baby Talk
Vowel Play with Algorithms: Helping Humans and Computers Learn Baby Talk PDF Print E-mail
Saturday, 01 March 2008 00:48

Few milestones stand out in a parent’s memory as clearly as his or her child’s first words. Those simple sounds are the fruition of thousands of hours of a parent’s instinctive tutoring. Constantly nurtured with “baby talk,” infants are introduced to their native tongue in a simple, accessible form – the high notes and exaggerated syllables are acoustically designed for optimal learning.

Teaching Computers Like Children
As reported in the August 14, 2007 issue of the Proceedings of the National Academy of Sciences, Dr. Gautam Vallabha and Professor James McClelland in the Stanford Department of Psychology designed a computer model to mimic an infant’s learning process. The goal of their research is to understand how infants learn linguistic structure, meaning, and ultimately complex language from hearing their mother’s voice. Wanting their algorithms to learn like children, Vallabha and McClelland taught them via baby talk, or infant-directed speech.

This infant-directed speech is characterized by a slower tempo, longer syllables, and more exaggerated vowel sounds as compared to normal speech. Pitch and intonation change as our voices adopt that peculiar, sing-song lilt of baby talk. Although we adopt these changes almost without thinking, they have a profound effect on infants’ abilities to assimilate the languages that surround them.

Can Computers “Learn” Like Humans?
The first reaction to Vallabha and McClelland’s research is to wonder whether a computer model of learning is at all relevant to the way humans learn language. Vallabha asserts that there is plenty of commonality. Computer models process information using a hierarchical set of rules. Similarly, our education is predicated on structure and regularity. We learn how to behave because we are able to extrapolate the rules of behavior from our prior experiences.

“People don’t behave randomly or arbitrarily,” Vallabha explains. “In particular contexts – restaurants, movie theaters, baseball games – they behave in regular and predictable ways. The world may seem chaotic and jumbled, but in certain contexts, it is full of statistical regularity and structure. We posit that human infants are exquisitely sensitive to statistical structure at all levels: syllable sequences, when to say certain words, when parents would scold them and so forth. Language happens to be a case where there is a lot of statistical regularity.”{gallery}printed_articles/volume-6-issue-2/vowel-play{/gallery}

Learning What’s Important
Infants are exceptionally adept at detecting different phonemes, the discrete sounds used in language. Infants also quickly learn which ones are relevant to their native language. Early on, English-speaking infants can readily recognize Hindi’s aspirated sounds even when their parents cannot. Japanese infants recognize the English phonemes /r/ and /l/ which are often indistinguishable to Japanese adults.

However, as children develop, their receptiveness to the full range of phonemes narrows, focusing on contrasts useful to their native language. Vallabha and McClelland theorize that during this time of diminishing phoneme repertoire, infants are focusing on learning the phonetic and syntactic rules that govern their own language through a process of intense repetition.

To mimic this process, Vallabha and McClelland’s computer models used repetition in order to classify discrete vowel sounds. The researchers employed a learning algorithm known as Expectation-Maximization. Essentially, the models began with very broad, uninformed ideas about how to categorize their data and gradually were able to form their own vowel sound categories by repeatedly analyzing and identifying similarities between phonemes.

In their experiments, the computer models analyzed data from recordings of both English- and Japanese-speaking mothers. All the mothers read the same set of nonsense words, both spontaneously and to their infants. The algorithms attempted to learn several “i” and “e” phonemes from each language. Why vowels in particular? Vallabha and McClelland’s models characterized sounds by elements of their frequencies and durations, both of which are easily distinguishable between vowels. “For a variety of reasons, consonants such as ‘p’, ‘d’, or ‘m’ are much more difficult to describe compactly in this way,” Vallabha explained.

Assuming Everything’s Normal
The first algorithm, the Parametric Algorithm for Online Mixture Estimation (OME), learns to classify vowels by assuming each vowel sound would follow a normal, Gaussian distribution. In other words, the researchers assumed that the way a vowel sounds when it is repeated over and over falls onto the classic bell-shaped curve. While this is a simplification, it is a fairly accurate model of the natural distribution of vowel sounds, both on the scale of an isolated speaker repeating the same vowel multiple times and on the larger scale of an entire population of speakers.

In using a Gaussian distribution for learning, OME identifies vowel sounds using a technique that is similar to the way that many researchers theorize humans perform sound categorization.This approach has also proved quite accurate: OME learned English vowel sounds with 84% accuracy and their Japanese analogs with 95% accuracy. The algorithm is also quite proficient at discerning between speakers. Just as infants are able to learn who is speaking their native language and who is not, the OME algorithm can distinguish between the English and Japanese speakers solely based upon their pronunciation of the nonsense words. The algorithm found far more commonalities in speakers of the same language than between speakers of different languages, revealing that a high degree of language-specific information must be encoded in infant-directed speech.

Taking Away the Safety Net

Vallabha and McClelland also designed a second algorithm, dubbed TOME (Topographic OME). TOME’s purpose was to mimic linguistic learning without utilizing a Gaussian distribution. Its categories are instead defined by breaking the input space of sounds into many small regions and calculating the proportion of input sounds in each region.

This method of “weighting” and strengthening categories through the repetition of similar information seems more promising as a neurobiological model. The robustness of biological neural networks in the brain depends on the synapses between neurons, or connections which vary in both their number and individual signal strength.

Our learning process reinforces existing synapses and stimulates the growth of new ones in a relevant neural network, thus acting quite like TOME.

TOME is more flexible than its OME counterpart since it is able to learn and classify even if the sounds do not follow the classic Gaussian distribution. However, the accuracy of the TOME algorithm at distinguishing between vowel sounds is currently inferior to that of OME.

Using Computers to Help Humans Learn Speech
“The models we propose are really first steps,” Vallabha explains. “It would be nice if they led to more detailed predictions for how infants learn language. In the long run, the hope is that the models would slowly become more complex – for example, being able to work with both consonants and vowels and fluently-spoken continuous speech – and integrate with other models of how children learn words and pronunciation and word morphology.”

Vallabha hopes that their work will have a more far-reaching impact, asserting that “the ultimate goal is to have a solid theory of how infants learn spoken language – and use that theory to design remediation for speech problems in children and to help adults learn second languages.”

 

Add your comment

Your name:
Subject:
Comment:
  The word for verification. Lowercase letters only with no spaces.
Word verification: