Lip Reading
From Wikipedia, the free encyclopedia
An audiologist conducting anaudiometric hearing test in a sound-proof testing booth |
Process
In everyday conversation, people with normal vision, hearing and social skills sub-consciously use information from the lips and face to aid aural comprehension and most fluent speakers of a language are able to speechread to some extent (see McGurk effect). This is because each speech sound (phoneme) has a particular facial and mouth position (viseme), and people can to some extent deduce what phoneme has been produced based on visual cues, even if the sound is unavailable or degraded (e.g. by background noise).
Lipreading while listening to spoken language provides the redundant audiovisual cues necessary to initially learn language, as evidenced by Lewkowicz who in his studies determined that babies between 4 and 8 months of age pay special attention to mouth movements when learning to speak both native and nonnative languages. While after 12 months of age enough audiovisual cues have been attained that they no longer have to look at the mouth when encountering a native language, hearing a nonnative language spoken again prompts this shift to visual and auditory engagement by way of lipreading and listening in order to process, understand and produce speech.
Research has shown that, as expected, deaf adults are better at lipreading than hearing adults due to their increased practice and heavier reliance on lip reading in order to understand speech. However when the same research team conducted a similar study with children it was determined that deaf and hearing children have similar lip reading skills. It is only after 14 years of age that skill levels between deaf and hearing children begin to differentiate significantly, indicating that lipreading skill in early life is independent of auditory capability. This may indicate a deterioration in lip reading ability with age for hearing individuals or an increased efficiency in lip reading ability with age for deaf individuals.
Lipreading has been proven to not only activate the visual cortex of the brain, but also the auditory cortex in the same way when actual speech is heard. Research has showed that rather than have clearcut different regions of the brain dedicated to different senses, the brain works in a mutisensory fashion, thus making a coordinated effort to consider and combine all the different types of speech information it receives, regardless of modality. Therefore, as hearing captures more articulatory detail than sight or touch the brain uses speech and sound to compensate for other senses.
Speechreading is limited, however, in that many phonemes share the same viseme and thus are impossible to distinguish from visual information alone. Sounds whose place of articulation is deep inside the mouth or throat are not detectable, such as glottal consonants and most gestures of the tongue. Voiced and unvoiced pairs look identical, such as [p] and [b], [k] and [g], [t] and [d], [f] and [v], and [s] and [z]; likewise for nasalisation (e.g. [m] vs. [b]). It has been estimated that only 30% to 40% of sounds in the English language are distinguishable from sight alone.
Thus, for example, the phrase "where there's life, there's hope" looks identical to "where's the lavender soap" in most English dialects. Author Henry Kisor titled his book What's That Pig Outdoors?: A Memoir of Deafness in reference to mishearing the question, "What's that big loud noise?" He used this example in the book to discuss the shortcomings of speechreading.
As a result, a speechreader must depend heavily on cues from the environment, from the context of the communication, and a knowledge of what is likely to be said. It is much easier to speechread customary phrases such as greetings or a connected discourse on a familiar topic than utterances that appear in isolation and without supporting information, such as the name of a person never met before.
Difficult scenarios in which to speechread include:
- Lack of a clear view of the speaker's lips. This includes:
- obstructions such as moustaches or hands in front of the mouth
- the speaker's head turned aside or away
- dark environment
- a bright back-lighting source such as a window behind the speaker, darkening the face.
- Group discussions, especially when multiple people are talking in quick succession. The challenge here is to know where to look.
- use of an unusual tone or rhythm of speech by the speaker
Tips for Lip Reading
Lip reading, also known as speechreading, is difficult because only 30% of the speech can be seen, the other 70% is inferred by context clues. Thus, there are little things that can be done to make the process a little easier. Learning to lip read is like learning to read a book. A novice lip reader will concentrate on each sound, and may miss the meaning. Lip reading will be more effective if you receive the message as a whole rather than each individual sound.
- Make sure you can see the speaker’s face clearly.
- Hold the conversation in a quiet environment, with good lighting, and not a lot of visual distractions.
- Make sure that light is behind you, not the person you are trying to lip read.
- Gently remind people that you need to see their face when they forget and look down or away from you.
- Ask for the topic of the conversation, if you are not sure.
- If the speaker exaggerates, or talks too loudly, gently request that they speak normally.
- Remind speakers to move their hands or other objects away from their face.
- If you still don’t understand after a repetition, ask the speaker to rephrase.
Learning to Lip Read
Lip reading can be taught, but initially infants begin to lip read between the age of 6 and 12 months. In order to imitate, a baby must learn to shape their lips in accordance with the sounds they are hearing. Even newborns have been shown to imitate adult mouth movements such as sticking out the tongue or opening the mouth, which could be a precursor to further imitation and lip reading abilities. Infants as young as 4 months have the ability to connect visual and auditory information, which is helpful when learning to lip read. For example, one study showed that infants tend to look longer at a visual stimulus that corresponds to an auditory stimulus they hear from a recording. This is why when people watch a movie online and the words don't match the sound it is annoying.
New studies have shown, that it is possible that aspects of lip reading may indicate signs of autism. Research from Florida Atlantic University compared groups of infants ( ages four to 12 months) to a group of adults in a test of lip reading abilities. The study discusses the significance of the shift babies make between watching the eyes and mouth of people speaking at different developmental stages. At age of four months, they typically focus their attention on the eyes for understanding. Between ages of six to eight months, during the "babbling" stage of language acquisition, they shift their focus to the mouth of the speaker. They continue lip reading until about 10 months of age, at which they switch their attention back to the eyes. Researchers suggest that the second stage relates to the emergence of speech and ability to better understand "social cues, shared meanings, beliefs and desires", according to professor of Psychology David J Lewkowicz. When hearing a language different from their native language, babies revert their attention back to the mouth, despite what stage of learning acquisition they are at; they continue to lip read up to about 12 months of age. Although, greater research is needed to support their claim, their data suggest that "the infants who continue to focus most of their attention on the mouth past 12 months of age are probably not developing the age-appropriate perceptual and cognitive skills and thus may be at risk for disorders like autism".
While lip reading is a natural ability that develops in babies at a young age, people can be taught to lip read and to become better lip readers. There are even trainers and teachers who can aid people when they are learning to lip read and help them to focus on certain context cues. Here are several ways lip reading can be taught or improved:
- training your eyes to help your ears
- watching the movements of the mouth, teeth and tongue
- reading the expression on the face
- noticing body language and gestures
- using residual hearing
- anticipation
Use of Speechreading by Deaf People
Speechreaders who have grown up deaf may never have heard the spoken language and are unlikely to be fluent users of it, which makes speechreading much more difficult. They must also learn the individual visemes by conscious training in an educational setting. In addition, speechreading takes a lot of focus, and can be extremely tiring. For these and other reasons, many deaf people prefer to use other means of communication with non-signers, such as mime and gesture, writing, and sign language interpreters.
To quote from Dorothy Clegg's 1953 book The Listening Eye, "When you are deaf you live inside a well-corked glass bottle. You see the entrancing outside world, but it does not reach you. After learning to lip read, you are still inside the bottle, but the cork has come out and the outside world slowly but surely comes in to you." This view—that speechreading, though difficult, can be successful—is relatively controversial within the deaf world; for an incomplete history of this debate, see manualism and oralism.
When talking with a deaf person who uses speechreading, exaggerated mouthing of words is not considered to be helpful and may in fact obscure useful clues. However, it is possible to learn to emphasize useful clues; this is known as "lip speaking".
Speechreading may be combined with cued speech—movements of the hands that visually represent otherwise invisible details of pronunciation. One of the arguments in favor of the use of cued speech is that it helps develop lip-reading skills that may be useful even when cues are absent, i.e., when communicating with non-deaf, non-hard of hearing people.
Cued speech helps to relieve speechreading ambiguities; ultimately a combined practice of lipreading and use of cued speech brings greater clarity and accuracy to understanding spoken sentences. Dr. R.Orin Cornett was the inventor of cued speech; before his passing in 2002 he is known for his work at Gallaudet University in Washington DC. During his research he did a study with 18 profoundly deaf children to test their understanding of language with different ways to improve clarity of sentences ( i.e. cued speech, lipreading, cued speech and lipreading etc.). These children had at least four years of cued speech instruction. His research showed that the clarity of language can be improved by up to 95% with the combination of lipreading and cued speech for those who are deaf. Just the same, a person who was listening, lipreading and was exposed to cued speech had increased understanding of the sentences. This is a significant increase compared to the 30% of words understood solely by lipreading. Thus, in order to augment one's understanding of sentences spoken, if one is deaf, one needs to rely on lipreading and cued speech; the combination of both will bring greater clarity of language.
See the original article:
Lip Reading From Wikipedia, the free encyclopedia
Speech Perception
From Wikipedia, the free encyclopedia
Speech perception is the process by which the sounds of language are heard, interpreted and understood. The study of speech perception is closely linked to the fields of phonetics and in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
Basics
The process of perceiving speech begins at the level of the sound signal and the process of audition. (For a complete description of the process of audition see Hearing.) After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition.
Acoustic Cues
It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound:
At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find, even after some forty-five years of research on the problem.If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests using speech synthesizers would be sufficient to determine such a cue or cues. However, there are two significant obstacles:
- One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases (like American English /ɛ/ and /æ/) it can distinguish the identity of vowels. Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.
- One linguistic unit can be cued by several acoustic properties. For example in a classic experiment, Alvin Liberman (1957) showed that the onset formant transitions of /d/ differ depending on the following vowel (see Figure 1) but they are all interpreted as the phoneme /d/ by listeners.
Linearity and the Segmentation Problem
Figure 2: A spectrogram of the phrase "I owe you". There are no clearly distinguishable boundaries between speech sounds. |
Having disputed the linearity of the speech signal, the problem of segmentation arises: one encounters serious difficulties trying to delimit a stretch of speech signal as belonging to a single perceptual unit. This can be illustrated by the fact that the acoustic properties of the phoneme /d/ will depend on the production of the following vowel (because of coarticulation).
Lack of Invariance
The research and application of speech perception must deal with several problems which result from what has been termed the lack of invariance. As was suggested above, reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find. There are several reasons for this:
- Context-induced variation. Phonetic environment affects the acoustic properties of speech sounds. For example, /u/ in English is fronted when surrounded by coronal consonants. Or, the VOT values marking the boundary between voiced and voiceless plosives are different for labial, alveolar and velar plosives and they shift under stress or depending on the position within a syllable.
- Variation due to differing speech conditions. One important factor that causes variation is differing speech rate. Many phonemic contrasts are constituted by temporal characteristics (short vs. long vowels or consonants, affricates vs. fricatives, plosives vs. glides, voiced vs. voiceless plosives, etc.) and they are certainly affected by changes in speaking tempo. Another major source of variation is articulatory carefulness vs. sloppiness which is typical for connected speech (articulatory "undershoot" is obviously reflected in the acoustic properties of the sounds produced).
- Variation due to different speaker identity. The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers. Men, women, and children generally produce voices having different pitch. Because speakers have vocal tracts of different sizes (due to sex and age especially) the resonant frequencies (formants), which are important for recognition of speech sounds, will vary in their absolute values across individuals (see Figure 3 for an illustration of this). Research shows that infants at the age of 7.5 months cannot recognize information presented by speakers of different genders; however by the age of 10.5 months, they can detect the similarities. Dialect and foreign accent can also cause variation, as can the social characteristics of the speaker and listener.
Perceptual Constancy and Normalization
Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy (see theories below). Perceptual constancy is a phenomenon not specific to speech perception only; it exists in other types of perception too.
Categorical Perception
Figure 4: Example identification (red) and discrimination (blue) functions
|
In an artificial continuum between a voiceless and a voiced bilabial plosive, each new step differs from the preceding one in the amount of VOT. The first sound is a pre-voiced [b], i.e. it has a negative VOT. Then, increasing the VOT, it reaches zero, i.e. the plosive is a plain unaspirated voiceless [p]. Gradually, adding the same amount of VOT at a time, the plosive is eventually a strongly aspirated voiceless bilabial [pʰ]. (Such a continuum was used in an experiment by Lisker and Abramson in 1970. The sounds they used are available online.) In this continuum of, for example, seven sounds, native English listeners will identify the first three sounds as /b/ and the last three sounds as /p/ with a clear boundary between the two categories. A two-alternative identification (or categorization) test will yield a discontinuous categorization function (see red curve in Figure 4).
In tests of the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other (20 ms for instance), listeners are likely to perform at chance level if both sounds fall within the same category and at nearly 100% level if each sound falls in a different category (see the blue discrimination curve in Figure 4).
The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed. Similar perceptual adjustment is attested for other acoustic cues as well.
Top-Down Influences
The process of speech perception is not necessarily uni-directional. That is, higher-level language processes connected with morphology, syntax, or semantics may interact with basic speech perception processes to aid in recognition of speech sounds. It may be the case that it is not necessary and maybe even not possible for a listener to recognize phonemes before recognizing higher units, like words for example. After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal, listeners can compensate for missing or noise-masked phonemes using their knowledge of the spoken language.
In a classic experiment, Richard M. Warren (1970) replaced one phoneme of a word with a cough-like sound. His subjects restored the missing speech sound perceptually without any difficulty and could not accurately identify which phoneme had been disturbed. This is known as the phonemic restoration effect. Another basic experiment compares recognition of naturally spoken words presented in a sentence (or at least a phrase) and the same words presented in isolation. Perception accuracy usually drops in the latter condition. Garnes and Bond (1976) also used carrier sentences when researching the influence of semantic knowledge on perception. They created series of words differing in one phoneme (bay/day/gay, for example). The quality of the first phoneme changed along a continuum. All these stimuli were put into different sentences each of which made sense with one of the words only. Listeners had a tendency to judge the ambiguous words (when the first segment was at the boundary between categories) according to the meaning of the whole sentence.
Brain Damage
Speech Perception with Acquired Brain Disabilities
The first ever hypothesis of speech perception were used with patients who acquired an auditory comprehension deficit, also known as Receptive Aphasia. Since then there have been many disabilities that have been classified, which resulted in a true definition of ‘speech perception’. The term ‘speech perception’ describes the process of interest that employs sub lexical contexts to the probe process. It consists of many different language and grammatical functions, such as: features, segments (phonemes), syllabic structure (unit of pronunciation), phonological word forms (how sounds are grouped together), grammatical features, morphemic (prefixes and suffixes), and semantic information (the meaning of the words). In the early years, they were more interested in the acoustics of speech. For instance, they were looking at the differences between /ba/ or /da/, but now research has been directed to the response in the brain from the stimuli. In recent years, there has been a model developed to create a sense of how speech perception works; this model is known as the Dual Stream Model. This model has drastically changed from how psychologists look at perception. The first section of the Dual Stream Model is the ventral pathway. This pathway incorporates middle temporal gyrus, inferior temporal sulcus and perhaps the inferior temporal gyrus. The ventral pathway shows phonological representations to the lexical or conceptual representations, which is the meaning of the words. The second section of the Dual Stream Model is the dorsal pathway. This pathway includes the sylvian parietotemporal, inferior frontal gyrus, anterior insula, and premotor cortex. Its primary function is to take the sensory or phonological stimuli and transfer it into an articulatory-motor representation (formation of speech).
Acquired Brain Disabilities Aphasia: There are two different kinds of aphasic patients: Expressive Aphasia (also known as Broca's Aphasia) and Receptive Aphasia (also known as Wernicke’s Aphasia). There are three distinctive dimensions to phonetics: manner of articulation, place of articulation, and voicing.
Expressive aphasia: Patients who suffer from this condition typically have lesions on their left inferior frontal cortex. These patients are described with having severe syntactical deficits, which means that they have extreme difficulty in forming sentences correctly. Expressive aphasic patients suffer from more regular rule governed principles in forming sentences, which is closely related to Alzheimer patients. For instance instead of saying the red ball bounced, both of these patients would say bounced ball the red. This is just one example of what a person might say; there are of course many possibilities.
Receptive Aphasia: The patients suffer from lesions or damage located in the left temproparietal lobe. Receptive Aphasic patients mostly suffer from lexical-semantic difficulties, but also have difficulties in comprehension tasks. Though they have difficulty saying things or describing things, these people showed that they could do well in online comprehension tasks. This is closely related to Parkinson’s disease because both of the diseases have trouble in distinguishing irregular verbs. For instance using the example of the dog went home, a person suffering from Expressive aphasia or Parkinson’s disease would say the dog goed home. Parkinson’s Disease This disease attacks the brain and makes the patients unable to stop shaking. The effects could be difficulty in walking, communicating, or functioning. Over time the symptoms go from mild to severe, which can cause extreme difficulties in a person’s life. Many psychologists relate Parkinson’s disease to Progressive Nonfluent Aphasia, which would cause a person to have comprehension deficits and being able to recognize irregular verbs. For instance using the example of the dog went home, a person suffering from Expressive aphasia or Parkinson’s disease would say the dog goed home.
Treatments Aphasia A group of psychologists conducted a study to test the McGurk effect with Aphasia patients and speech reading. The subjects watched dubbed videos in which the audio and visual did not match. Then after they completed the first part of the experiment, the experimenters taught the aphasic patients to speech read, which is the ability to read lips. The experimenters then conducted the same test and found that the people still had more of an advantage of audio only over visual only, but they also found that the subjects did better in audio-visual than audio alone. The patients also did improve their place of articulation and their manner of articulation. This all means that aphasic patients might benefit from learning how to speech read (lip reading). Parkinson’s Disease There are quite a few drug therapies that are possible for Parkinson’s disease (ex. Sinemet). Since there is no cure for it, the patient will probably end up having to have surgery done to relieve some of the symptoms. When a patient has this procedure done, they are most likely going to receive a deep brain stimulation. So it will keep the brain stimulated even though the disease tries to disable it. Recently a study was performed to test if surgery helps the patients discover their symptoms post surgery than pre-surgery. They found that the symptoms were still present but the patients were more aware of their difficulties than before they had surgery. This shows that surgery does improve a patients speech perception, even though it might not cure their disease.
Research Topics
Infant Speech Perception
Infants begin the process of language acquisition by being able to detect very small differences between speech sounds. They can discriminate all possible speech contrasts (phonemes). Gradually, as they are exposed to their native language, their perception becomes language-specific, i.e. they learn how to ignore the differences within phonemic categories of the language (differences that may well be contrastive in other languages – for example, English distinguishes two voicing categories of plosives, whereas Thai has three categories; infants must learn which differences are distinctive in their native language uses, and which are not). As infants learn how to sort incoming speech sounds into categories, ignoring irrelevant differences and reinforcing the contrastive ones, their perception becomes categorical. Infants learn to contrast different vowel phonemes of their native language by approximately 6 months of age. The native consonantal contrasts are acquired by 11 or 12 months of age. Some researchers have proposed that infants may be able to learn the sound categories of their native language through passive listening, using a process called statistical learning. Others even claim that certain sound categories are innate, that is, they are genetically specified (see discussion about innate vs. acquired categorical distinctiveness).
If day-old babies are presented with their mother's voice speaking normally, abnormally (in monotone), and a stranger's voice, they react only to their mother's voice speaking normally. When a human and a non-human sound is played, babies turn their head only to the source of human sound. It has been suggested that auditory learning begins already in the pre-natal period.
One of the techniques used to examine how infants perceive speech, besides the head-turn procedure mentioned above, is measuring their sucking rate. In such an experiment, a baby is sucking a special nipple while presented with sounds. First, the baby's normal sucking rate is established. Then a stimulus is played repeatedly. When the baby hears the stimulus for the first time the sucking rate increases but as the baby becomes habituated to the stimulation the sucking rate decreases and levels off. Then, a new stimulus is played to the baby. If the baby perceives the newly introduced stimulus as different from the background stimulus the sucking rate will show an increase. The sucking-rate and the head-turn method are some of the more traditional, behavioral methods for studying speech perception. Among the new methods (see Research methods below) that help us to study speech perception, near-infrared spectroscopy is widely used in infants.
It has also been discovered that even though infants' ability to distinguish between the different phonetic properties of various languages begins to decline around the age of nine months, it is possible to reverse this process by exposing them to a new language in a sufficient way. In a research study by Patricia K. Kuhl, Feng-Ming Tsao, and Huei-Mei Liu, it was discovered that if infants are spoken to and interacted with by a native speaker of Mandarin Chinese, they can actually be conditioned to retain their ability to distinguish different speech sounds within Mandarin that are very different from speech sounds found within the English language. Thus proving that given the right conditions, it is possible to prevent infants' loss of the ability to distinguish speech sounds in languages other than those found in the native language.
Cross-Language and Second-Language Speech Perception
A large amount of research has studied how users of a language perceive foreign speech (referred to as cross-language speech perception) or second-language speech (second-language speech perception). The latter falls within the domain of second language acquisition.
Languages differ in their phonemic inventories. Naturally, this creates difficulties when a foreign language is encountered. For example, if two foreign-language sounds are assimilated to a single mother-tongue category the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquid consonants /l/ and /r/ (see Japanese speakers learning r and l).
Best (1995) proposed a Perceptual Assimilation Model which describes possible cross-language category assimilation patterns and predicts their consequences. Flege (1995) formulated a Speech Learning Model which combines several hypotheses about second-language (L2) speech acquisition and which predicts, in simple words, that an L2 sound that is not too similar to a native-language (L1) sound will be easier to acquire than an L2 sound that is relatively similar to an L1 sound (because it will be perceived as more obviously "different" by the learner).
Speech Perception in Language or Hearing Impairment
Research in how people with language or hearing impairment perceive speech is not only intended to discover possible treatments. It can provide insight into the principles underlying non-impaired speech perception. Two areas of research can serve as an example:
- Listeners with aphasia. Aphasia affects both the expression and reception of language. Both two most common types, Expressive aphasia and Receptive aphasia, affect speech perception to some extent. Expressive aphasia causes moderate difficulties for language understanding. The effect of Receptive aphasia on understanding is much more severe. It is agreed upon, that aphasics suffer from perceptual deficits. They usually cannot fully distinguish place of articulation and voicing. As for other features, the difficulties vary. It has not yet been proven whether low-level speech-perception skills are affected in aphasia sufferers or whether their difficulties are caused by higher-level impairment alone.
- Listeners with cochlear implants. Cochlear implantation restores access to the acoustic signal in individuals with sensorineural hearing loss. The acoustic information conveyed by an implant is usually sufficient for implant users to properly recognize speech of people they know even without visual clues. For cochlear implant users, it is more difficult to understand unknown speakers and sounds. The perceptual abilities of children that received an implant after the age of two are significantly better than of those who were implanted in adulthood. A number of factors have been shown to influence perceptual performance, specifically: duration of deafness prior to implantation, age of onset of deafness, age at implantation (such age effects may be related to the Critical period hypothesis) and the duration of using an implant. There are differences between children with congenital and acquired deafness. Postlingually deaf children have better results than the prelingually deaf and adapt to a cochlear implant faster. In both children with cochlear implants and normal hearing, vowels and voice onset time becomes prevalent in development before the ability to discriminate the place of articulation. Several months following implantation, children with cochlear implants can normalize speech perception.
Noise
One of the basic problems in the study of speech is how to deal with the noise in the speech signal. This is shown by the difficulty that computer speech recognition systems have with recognizing human speech. These programs can do well at recognizing speech when they have been trained on a specific speaker's voice, and under quiet conditions. However, these systems often do poorly in more realistic listening situations where humans can understand speech without difficulty.
Music-Language Connection
Research into the relationship between music and cognition is an emerging field related to the study of speech perception. Originally it was theorized that the neural signals for music were processed in a specialized "module" in the right hemisphere of the brain. Conversely, the neural signals for language were to be processed by a similar "module" in the left hemisphere. However, utilizing technologies such as fMRI machines, research has shown that two regions of the brain traditionally considered exclusively to process speech, Broca's and Wernicke's areas, also become active during musical activities such as listening to a sequence of musical chords. Other studies, such as one performed by Marques et al. in 2006 showed that 8-year-olds who were given six months of musical training showed an increase in both their pitch detection performance and their electrophysiological measures when made to listen to an unknown foreign language.
Conversely, some research has revealed that, rather than music affecting our perception of speech, our native speech can affect our perception of music. One example is the tritone paradox. The tritone paradox is where a listener is presented with two computer-generated tones (such as C and C-Sharp) that are half an octave (or a tritone) apart and are then asked to determine whether the pitch of the sequence is descending or ascending. One such study, performed by Ms. Diana Deutsch, found that the listeners interpretation of ascending or descending pitch was influenced by the listeners language or dialect, showing variation between those raised in the south of England and those in California or from those in Vietnam and those in California whose native language was English. A second study, performed in 2006 on a group of English speakers and 3 groups of East Asian students at University of Southern California, discovered that English speakers who had begun musical training at or before age 5 had an 8% chance of having perfect pitch. For the East Asian students who were fluent in their native tonal language, 92 percent of the students had perfect pitch.
Research Methods
The methods used in speech perception research can be roughly divided into three groups: behavioral, computational, and, more recently, neurophysiological methods. Behavioral experiments are based on an active role of a participant, i.e. subjects are presented with stimuli and asked to make conscious decisions about them. This can take the form of an identification test, a discrimination test, similarity rating, etc. These types of experiments help to provide a basic description of how listeners perceive and categorize speech sounds.
Computational modeling has also been used to simulate how speech may be processed by the brain to produce behaviors that are observed. Computer models have been used to address several questions in speech perception, including how the sound signal itself is processed to extract the acoustic cues used in speech, and how speech information is used for higher-level processes, such as word recognition.
Neurophysiological methods rely on utilizing information stemming from more direct and not necessarily conscious (pre-attentative) processes. Subjects are presented with speech stimuli in different types of tasks and the responses of the brain are measured. The brain itself can be more sensitive than it appears to be through behavioral responses. For example, the subject may not show sensitivity to the difference between two speech sounds in a discrimination test, but brain responses may reveal sensitivity to these differences. Methods used to measure neural responses to speech include event-related potentials, magnetoencephalography, and near infrared spectroscopy. One important response used with event-related potentials is the mismatch negativity, which occurs when speech stimuli are acoustically different from a stimulus that the subject heard previously.
Neurophysiological methods were introduced into speech perception research for several reasons:
Behavioral responses may reflect late, conscious processes and be affected by other systems such as orthography, and thus they may mask speaker’s ability to recognize sounds based on lower-level acoustic distributions.Without the necessity of taking an active part in the test, even infants can be tested; this feature is crucial in research into acquisition processes. The possibility to observe low-level auditory processes independently from the higher-level ones makes it possible to address long-standing theoretical issues such as whether or not humans possess a specialized module for perceiving speech or whether or not some complex acoustic invariance (see lack of invariance above) underlies the recognition of a speech sound.
Theories
Research into speech perception (SP) has by no means explained every aspect of the processes involved. A lot of what has been said about SP is a matter of theory. Several theories have been devised to develop some of the above mentioned and other unclear issues. Not all of them give satisfactory explanations of all problems, however the research they inspired has yielded a lot of useful data.
Speech Mode Hypothesis
Speech Mode Hypothesis is the idea that the perception of speech requires the use of specialized mental processing. The Speech Mode Hypothesis is a branch off of Fodor's Modularity Theory (see Modularity of Mind). It utilizes a vertical processing mechanism where limited stimuli are processed by special-purpose areas of the brain that are stimuli specific.
Two Versions of Speech Mode Hypothesis
- Weak Version - Listening to speech engages previous knowledge of language.
- Strong Version - Listening to speech engages specialized speech mechanisms for perceiving speech.
Three important experimental paradigms have evolved in the search to find evidence for the speech mode hypothesis. These are dichotic listening, categorical perception, and duplex perception. Through the research in these categories it has been found that there may not be a specific speech mode but instead one for auditory codes that require complicated auditory processing. Also it seems that modulatiy is learned in perceptual systems. Despite this the evidence and counter-evidence for the Speech Mode Hypothesis is still unclear and needs further research.
Motor Theory
Some of the earliest work in the study of how humans perceive speech sounds was conducted by Alvin Liberman and his colleagues at Haskins Laboratories. Using a speech synthesizer, they constructed speech sounds that varied in place of articulation along a continuum from /bɑ/ to /dɑ/ to /ɡɑ/. Listeners were asked to identify which sound they heard and to discriminate between two different sounds. The results of the experiment showed that listeners grouped sounds into discrete categories, even though the sounds they were hearing were varying continuously. Based on these results, they proposed the notion of categorical perception as a mechanism by which humans can identify speech sounds.
More recent research using different tasks and methods suggests that listeners are highly sensitive to acoustic differences within a single phonetic category, contrary to a strict categorical account of speech perception.
To provide a theoretical account of the categorical perception data, Liberman and colleagues worked out the motor theory of speech perception, where "the complicated articulatory encoding was assumed to be decoded in the perception of speech by the same processes that are involved in production" (this is referred to as analysis-by-synthesis). For instance, the English consonant /d/ may vary in its acoustic details across different phonetic contexts (see above), yet all /d/'s as perceived by a listener fall within one category (voiced alveolar plosive) and that is because "lingustic [sic?] representations are abstract, canonical, phonetic segments or the gestures that underlie these segments." When describing units of perception, Liberman later abandoned articulatory movements and proceeded to the neural commands to the articulators and even later to intended articulatory gestures, thus "the neural representation of the utterance that determines the speaker's production is the distal object the listener perceives". The theory is closely related to the modularity hypothesis, which proposes the existence of a special-purpose module, which is supposed to be innate and probably human-specific.
The theory has been criticized in terms of not being able to "provide an account of just how acoustic signals are translated into intended gestures" by listeners. Furthermore, it is unclear how indexical information (e.g. talker-identity) is encoded/decoded along with linguistically relevant information.
Direct Realist Theory
The direct realist theory of speech perception (mostly associated with Carol Fowler) is a part of the more general theory of direct realism, which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived. For speech perception, the theory asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures. Listeners perceive gestures not by means of a specialized decoder (as in the Motor Theory) but because information in the acoustic signal specifies the gestures that form it. By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception, the theory bypasses the problem of lack of invariance.
Fuzzy-Logical Model
The fuzzy logical theory of speech perception developed by Dominic Massaro proposes that people remember speech sounds in a probabilistic, or graded, way. It suggests that people remember descriptions of the perceptual units of language, called prototypes. Within each prototype various features may combine. However, features are not just binary (true or false), there is a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category. Thus, when perceiving a speech signal our decision about what we actually hear is based on the relative goodness of the match between the stimulus information and values of particular prototypes. The final decision is based on multiple features or sources of information, even visual information (this explains the McGurk effect). Computer models of the fuzzy logical theory have been used to demonstrate that the theory's predictions of how speech sounds are categorized correspond to the behavior of human listeners.
Acoustic Landmarks and Distinctive Features
In addition to the proposals of Motor Theory and Direct Realism about the relation between phonological features and articulatory gestures, Kenneth N. Stevens proposed another kind of relation: between phonological features and auditory properties. According to this view, listeners are inspecting the incoming signal for the so-called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. Since these gestures are limited by the capacities of humans' articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model. The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. Bundles of them uniquely specify phonetic segments (phonemes, syllables, words).
Exemplar Theory
Exemplar models of speech perception differ from the four theories mentioned above which suppose that there is no connection between word- and talker-recognition and that the variation across talkers is "noise" to be filtered out.
The exemplar-based approaches claim listeners store information for both word- and talker-recognition. According to this theory, particular instances of speech sounds are stored in the memory of a listener. In the process of speech perception, the remembered instances of e.g. a syllable stored in the listener's memory are compared with the incoming stimulus so that the stimulus can be categorized. Similarly, when recognizing a talker, all the memory traces of utterances produced by that talker are activated and the talker's identity is determined. Supporting this theory are several experiments reported by Johnson that suggest that our signal identification is more accurate when we are familiar with the talker or when we have visual representation of the talker's gender. When the talker is unpredictable or the sex misidentified, the error rate in word-identification is much higher.
The exemplar models have to face several objections, two of which are (1) insufficient memory capacity to store every utterance ever heard and, concerning the ability to produce what was heard, (2) whether also the talker's own articulatory gestures are stored or computed when producing utterances that would sound as the auditory memories.
See the original article:
Speech Perception From Wikipedia, the free encyclopedia
No comments:
Post a Comment