RemNote Community
Community

Phonetics - Acoustic and Auditory Foundations

Understand the acoustic properties of speech sounds, how vowels are classified using formants, and the main theories of speech perception.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the most common airstream mechanism used in speech?
1 of 18

Summary

Acoustic Characteristics of Speech Airstream Mechanisms When we produce speech sounds, we need a source of air to create acoustic energy. The most common way to generate this airflow is through pulmonic airstream mechanism—simply put, we use air from our lungs that passes through the vocal tract. This is how the vast majority of human speech sounds are produced. However, the lungs are not the only possible source. For certain specialized speech sounds, speakers can use the glottis (the space between the vocal folds) or the tongue as alternative airstream sources. These less common mechanisms produce sounds with different acoustic characteristics, though they're used far less frequently in everyday speech across languages. Voicing and Phonation Details A crucial distinction in speech sounds is whether the vocal folds vibrate. This distinction creates two categories: voiced and voiceless sounds. Voiced sounds occur when the vocal folds vibrate as air passes through them. This vibration creates a periodic waveform—imagine a repeating, regular pattern of sound waves. This periodic vibration has two key acoustic components: a fundamental frequency (the lowest frequency of vibration, often abbreviated as F0) and harmonics (multiples of that fundamental frequency that occur naturally when something vibrates). The fundamental frequency is directly related to how fast the vocal folds vibrate, which correlates with perceived pitch. Voiceless sounds, by contrast, have no vocal fold vibration. This means they lack the periodic structure of voiced sounds. Voiceless sounds can be further divided into two types: voiceless plosives (like [p] or [t]) produce near-silence, while voiceless fricatives (like [s] or [f]) generate turbulent, noise-like acoustic energy as air flows forcefully through a narrow constriction in the vocal tract. <extrainfo> The distinction between voiced and voiceless is so fundamental that it appears in nearly all human languages and is one of the easiest ways to categorize consonant sounds. </extrainfo> Vowel Description Vowels are described along three primary dimensions, each relating to the physical position of the tongue and lips. Vowel Height describes how high or low the tongue sits in the mouth. The traditional categories are: High (or "close") vowels, where the tongue is raised high in the mouth, like the vowel in "fleece" [i] Close-mid vowels, with the tongue in an intermediate high position Open-mid vowels, with the tongue in an intermediate low position Low (or "open") vowels, where the tongue is lowered, like the vowel in "lot" [ɑ] Vowel Backness describes the front-to-back position of the tongue. There are three categories: Front vowels, where the tongue is positioned toward the front of the mouth, like [i] in "fleece" Central vowels, where the tongue is roughly in the middle, like [ə] (schwa) Back vowels, where the tongue is positioned toward the back of the mouth, like [u] in "goose" Lip Rounding is the third dimension. Vowels can be either rounded (with protruded lips, like [u]) or unrounded (with neutral or spread lips, like [i]). Interestingly, lip rounding often correlates with vowel height and backness—back vowels tend to be rounded more frequently, while front vowels tend to be unrounded. These three dimensions together completely describe a vowel's articulatory position and allow linguists to classify all vowel sounds across languages. Formants and Acoustic Vowel Quality While the articulatory descriptions above tell us where the tongue is positioned, they don't directly explain how we hear vowel differences. This is where formants come in. Formants are resonant frequencies of the vocal tract—specific frequencies where the vocal tract naturally amplifies sound energy. Think of it like this: when the vocal folds vibrate, they produce many frequencies (the fundamental and its harmonics). But the shape of the vocal tract acts as a filter, boosting some frequencies and dampening others. The frequencies that are boosted most are the formants. Different vowel shapes (created by different tongue positions) create different vocal tract configurations, which in turn create different formants. This is the crucial link between articulation and acoustics. The first two formants (F1 and F2) are the primary acoustic correlates of vowel quality—meaning these are the main frequencies that our ears use to distinguish one vowel from another. F1 is particularly associated with vowel height (lower values for high vowels, higher values for low vowels), while F2 is particularly associated with vowel backness (higher values for front vowels, lower values for back vowels). <extrainfo> Additional Vowel Features Beyond the basic three-way distinction (height, backness, rounding), some languages use additional features to create vowel contrasts. These include: Vowel nasality: Whether air flows through the nose during vowel production (as in French "on" [ɔ̃]) Vowel length: Longer versus shorter duration versions of the "same" vowel Voice quality: Features like creaky voice (very low-frequency vocal fold vibration) or breathy voice (incomplete closure of the vocal folds allowing air leakage) Advanced tongue root: Whether the base of the tongue is pushed forward, altering the shape of the pharynx Pharyngealization: Constriction of the pharynx, which affects resonance These features are less universal than height, backness, and rounding, but they're important in specific languages and may appear on an exam if they were covered in your course materials. </extrainfo> Speech Perception Overview Speech perception is the process by which listeners decode an acoustic signal into meaningful linguistic units. This isn't a simple matter of hearing sounds and automatically understanding them. Instead, our brains must interpret the continuous acoustic stream and parse it into discrete units: individual phonemes (the smallest sound units that distinguish meaning), morphemes (meaningful units like prefixes and suffixes), and words (complete units with meaning). This process happens nearly instantaneously and unconsciously, but it actually involves remarkable cognitive complexity. The Auditory System To understand how speech is perceived, we first need to understand how sound reaches the brain. When sound waves hit your eardrum (tympanum), they cause it to vibrate. These vibrations are then transferred across the middle ear through three tiny bones called ossicles: the malleus, incus, and stapes. These bones are mechanical amplifiers that increase the efficiency of sound transmission. The vibrations are then passed to the cochlea, a fluid-filled, spiral-shaped structure in the inner ear. Here is where frequency analysis happens. The cochlea contains a structure called the basilar membrane that varies in thickness along its length. Different frequencies of sound cause different parts of this membrane to vibrate maximally—this creates what's called a tonotopic map, where specific locations on the basilar membrane are "tuned" to specific frequencies. High frequencies cause vibrations near the entrance of the cochlea, while low frequencies cause vibrations deeper inside. On top of the basilar membrane sit hair cells, which are sensory receptors. These hair cells detect the mechanical vibrations of the basilar membrane and convert them into neural signals—a process called transduction. These neural signals travel via the auditory nerve to the brainstem and up to the auditory cortex, where they're processed as meaningful information. Understanding this pathway is important because it explains why speech perception begins with frequency analysis at the physical level—our ears literally decompose sound into its frequency components before our brains interpret what we're hearing. Prosody While the basics of speech perception rely on distinguishing individual phonemes, prosodic features convey important additional information layered on top of the basic sounds. Prosody refers to the intonation, rhythm, and stress patterns of speech. The main components include: Pitch: The perceived frequency of the voice, primarily determined by the fundamental frequency (how fast the vocal folds vibrate) Speech rate: How fast someone is talking Duration: How long individual sounds or syllables are held Loudness: The overall volume or intensity of the speech These prosodic features serve multiple functions. They can signal stress (which syllable in a word should be emphasized), pitch accents (where melodic emphasis falls), and intonation patterns (the overall melody of a sentence, like whether it's a statement or a question). For example, in English, the difference between "PREsent" (noun) and "preSENT" (verb) is primarily a matter of prosodic stress, even though the consonants and vowels are the same. Prosody is also culturally specific—different languages use pitch and timing patterns in different ways, and this is something learners must acquire when learning a non-native language. Theories of Speech Perception How exactly do our brains convert the acoustic signal into linguistic meaning? This is a major question in speech science, and researchers have proposed several competing theories. Understanding these theories is essential because they explain different aspects of how we decode speech. Motor Theory The motor theory of speech perception proposes something counterintuitive: to understand a speech sound, listeners don't just passively hear it—instead, they access the articulatory gestures (the movements of the mouth, tongue, vocal folds, etc.) that would be needed to produce that sound themselves. In other words, we understand speech by internally simulating how we would produce it. This theory has an appealing logic: it explains why we're particularly good at understanding our native language (we've practiced producing those sounds ourselves) and why speech perception seems to involve motor areas of the brain, not just auditory areas. However, strong versions of this theory face a challenge: we can perceive speech sounds that we cannot produce (like hearing someone with a very different voice), and even young infants who cannot yet produce speech can perceive it. This has led to weaker forms of motor theory, which propose that there is a nondeterministic (not perfectly predictable) relationship between production and perception. In other words, articulatory information helps inform perception, but perception isn't purely dependent on motor simulation—other information matters too. Abstractionist Theories Abstractionist theories take a different approach. Rather than focusing on how sounds are produced, they focus on what listeners extract from the acoustic signal. These theories propose that perception involves: Extracting an idealized lexical representation—a abstract, simplified version of what the word should sound like Normalizing acoustic variability—accounting for the fact that the same word sounds different when spoken by different speakers, in different contexts, or with different intonations In other words, abstractionists argue that listeners don't store or remember the specific acoustic details of speech. Instead, they extract more abstract, generalized representations that capture the essential information needed to distinguish one word from another. This approach explains why we can understand speakers with very different voices (we abstract away the speaker-specific details) and why we interpret the same acoustic signal differently depending on context. Episodic and Exemplar Theories Episodic theories (also called exemplar theories) propose something quite different: listeners actually do store detailed memory traces of previously heard speech tokens. When you encounter a new word, you compare it against all the previous examples of that word you've heard—what linguists call "exemplars" or detailed memories. According to this approach, listeners use familiarity—essentially how well the current token matches the exemplars in memory—to categorize what they're hearing and resolve variability. A familiar accent or pronunciation is easier to understand because it matches exemplars in memory. An unfamiliar accent is harder because there are fewer similar exemplars to compare against. This theory has interesting implications: it suggests that our perception of speech changes throughout our lives as we encounter new speakers and accents, and it explains why we gradually adjust to understanding non-native speakers or speakers with different dialects. Comparing the Theories These theories are not mutually exclusive. Modern speech perception science increasingly suggests that perception likely involves multiple processes operating in parallel: Some perceptual decisions may involve motorically-based information Some may involve abstracting idealized representations Some may involve comparing against episodic memories The relative contribution of each mechanism likely varies depending on factors like the clarity of the signal, whether the speech is from a familiar or unfamiliar speaker, and the linguistic experience of the listener.
Flashcards
What is the most common airstream mechanism used in speech?
Pulmonic
What physical action creates the periodic waveform and fundamental frequency in voiced sounds?
Vocal fold vibration
What acoustic feature is generated by voiceless fricatives due to the lack of vocal-fold vibration?
Turbulence
What acoustic state characterizes voiceless plosives?
Silence
What are the four primary levels of vowel height (vertical tongue position)?
High (close) Close-mid Open-mid Low (open)
What are the three classifications for vowel backness (horizontal tongue position)?
Front Central Back
Which articulatory feature often correlates with vowel height and backness?
Lip rounding
What term refers to the resonant frequencies of the vocal tract that characterize vowel quality?
Formants
Which specific formants are the primary acoustic correlates used to distinguish vowels?
$F1$ and $F2$ (the first and second formants)
What is the definition of speech perception?
The process of decoding an acoustic signal into discrete linguistic units (phonemes, morphemes, words)
How is the vibration of the eardrum transferred to the cochlea?
By the middle-ear ossicles
Which cells on the basilar membrane transduce mechanical vibrations into neural signals?
Hair cells
Through which nerve do neural signals from the cochlea travel to the brainstem?
Auditory nerve
What four acoustic features are considered prosodic features?
Pitch Speech rate Duration Loudness
According to Motor Theory, how do listeners categorize sounds?
By accessing the articulatory gestures that would produce them
How do weaker forms of Motor Theory characterize the relationship between production and perception?
As nondeterministic
What does Abstractionist theory argue is the primary goal of speech perception?
Extracting an idealized lexical representation and normalizing acoustic variability
Upon what does Episodic (exemplar) theory contend that speech perception relies?
Detailed memory traces of previously heard tokens

Quiz

Besides the lungs, which two articulators can serve as airstream sources for certain speech sounds?
1 of 11
Key Concepts
Speech Production Mechanisms
Pulmonic airstream mechanism
Voicing (phonation)
Vowel height
Vowel backness
Vowel nasality
Speech Perception Theories
Motor theory of speech perception
Episodic (exemplar) theory of speech perception
Acoustic Features of Speech
Formant
Cochlea
Prosody