Subjects/Languages/Language Studies/Linguistics/Phonetics

Phonetics - Acoustic and Auditory Foundations

Understand the acoustic properties of speech sounds, how vowels are classified using formants, and the main theories of speech perception.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the most common airstream mechanism used in speech?

1 of 18

Summary

Acoustic Characteristics of Speech Airstream Mechanisms When we produce speech sounds, we need a source of air to create acoustic energy. The most common way to generate this airflow is through pulmonic airstream mechanism—simply put, we use air from our lungs that passes through the vocal tract. This is how the vast majority of human speech sounds are produced. However, the lungs are not the only possible source. For certain specialized speech sounds, speakers can use the glottis (the space between the vocal folds) or the tongue as alternative airstream sources. These less common mechanisms produce sounds with different acoustic characteristics, though they're used far less frequently in everyday speech across languages. Voicing and Phonation Details A crucial distinction in speech sounds is whether the vocal folds vibrate. This distinction creates two categories: voiced and voiceless sounds. Voiced sounds occur when the vocal folds vibrate as air passes through them. This vibration creates a periodic waveform—imagine a repeating, regular pattern of sound waves. This periodic vibration has two key acoustic components: a fundamental frequency (the lowest frequency of vibration, often abbreviated as F0) and harmonics (multiples of that fundamental frequency that occur naturally when something vibrates). The fundamental frequency is directly related to how fast the vocal folds vibrate, which correlates with perceived pitch. Voiceless sounds, by contrast, have no vocal fold vibration. This means they lack the periodic structure of voiced sounds. Voiceless sounds can be further divided into two types: voiceless plosives (like [p] or [t]) produce near-silence, while voiceless fricatives (like [s] or [f]) generate turbulent, noise-like acoustic energy as air flows forcefully through a narrow constriction in the vocal tract. <extrainfo> The distinction between voiced and voiceless is so fundamental that it appears in nearly all human languages and is one of the easiest ways to categorize consonant sounds. </extrainfo> Vowel Description Vowels are described along three primary dimensions, each relating to the physical position of the tongue and lips. Vowel Height describes how high or low the tongue sits in the mouth. The traditional categories are: High (or "close") vowels, where the tongue is raised high in the mouth, like the vowel in "fleece" [i] Close-mid vowels, with the tongue in an intermediate high position Open-mid vowels, with the tongue in an intermediate low position Low (or "open") vowels, where the tongue is lowered, like the vowel in "lot" [ɑ] Vowel Backness describes the front-to-back position of the tongue. There are three categories: Front vowels, where the tongue is positioned toward the front of the mouth, like [i] in "fleece" Central vowels, where the tongue is roughly in the middle, like [ə] (schwa) Back vowels, where the tongue is positioned toward the back of the mouth, like [u] in "goose" Lip Rounding is the third dimension. Vowels can be either rounded (with protruded lips, like [u]) or unrounded (with neutral or spread lips, like [i]). Interestingly, lip rounding often correlates with vowel height and backness—back vowels tend to be rounded more frequently, while front vowels tend to be unrounded. These three dimensions together completely describe a vowel's articulatory position and allow linguists to classify all vowel sounds across languages. Formants and Acoustic Vowel Quality While the articulatory descriptions above tell us where the tongue is positioned, they don't directly explain how we hear vowel differences. This is where formants come in. Formants are resonant frequencies of the vocal tract—specific frequencies where the vocal tract naturally amplifies sound energy. Think of it like this: when the vocal folds vibrate, they produce many frequencies (the fundamental and its harmonics). But the shape of the vocal tract acts as a filter, boosting some frequencies and dampening others. The frequencies that are boosted most are the formants. Different vowel shapes (created by different tongue positions) create different vocal tract configurations, which in turn create different formants. This is the crucial link between articulation and acoustics. The first two formants (F1 and F2) are the primary acoustic correlates of vowel quality—meaning these are the main frequencies that our ears use to distinguish one vowel from another. F1 is particularly associated with vowel height (lower values for high vowels, higher values for low vowels), while F2 is particularly associated with vowel backness (higher values for front vowels, lower values for back vowels). <extrainfo> Additional Vowel Features Beyond the basic three-way distinction (height, backness, rounding), some languages use additional features to create vowel contrasts. These include: Vowel nasality: Whether air flows through the nose during vowel production (as in French "on" [ɔ̃]) Vowel length: Longer versus shorter duration versions of the "same" vowel Voice quality: Features like creaky voice (very low-frequency vocal fold vibration) or breathy voice (incomplete closure of the vocal folds allowing air leakage) Advanced tongue root: Whether the base of the tongue is pushed forward, altering the shape of the pharynx Pharyngealization: Constriction of the pharynx, which affects resonance These features are less universal than height, backness, and rounding, but they're important in specific languages and may appear on an exam if they were covered in your course materials. </extrainfo> Speech Perception Overview Speech perception is the process by which listeners decode an acoustic signal into meaningful linguistic units. This isn't a simple matter of hearing sounds and automatically understanding them. Instead, our brains must interpret the continuous acoustic stream and parse it into discrete units: individual phonemes (the smallest sound units that distinguish meaning), morphemes (meaningful units like prefixes and suffixes), and words (complete units with meaning). This process happens nearly instantaneously and unconsciously, but it actually involves remarkable cognitive complexity. The Auditory System To understand how speech is perceived, we first need to understand how sound reaches the brain. When sound waves hit your eardrum (tympanum), they cause it to vibrate. These vibrations are then transferred across the middle ear through three tiny bones called ossicles: the malleus, incus, and stapes. These bones are mechanical amplifiers that increase the efficiency of sound transmission. The vibrations are then passed to the cochlea, a fluid-filled, spiral-shaped structure in the inner ear. Here is where frequency analysis happens. The cochlea contains a structure called the basilar membrane that varies in thickness along its length. Different frequencies of sound cause different parts of this membrane to vibrate maximally—this creates what's called a tonotopic map, where specific locations on the basilar membrane are "tuned" to specific frequencies. High frequencies cause vibrations near the entrance of the cochlea, while low frequencies cause vibrations deeper inside. On top of the basilar membrane sit hair cells, which are sensory receptors. These hair cells detect the mechanical vibrations of the basilar membrane and convert them into neural signals—a process called transduction. These neural signals travel via the auditory nerve to the brainstem and up to the auditory cortex, where they're processed as meaningful information. Understanding this pathway is important because it explains why speech perception begins with frequency analysis at the physical level—our ears literally decompose sound into its frequency components before our brains interpret what we're hearing. Prosody While the basics of speech perception rely on distinguishing individual phonemes, prosodic features convey important additional information layered on top of the basic sounds. Prosody refers to the intonation, rhythm, and stress patterns of speech. The main components include: Pitch: The perceived frequency of the voice, primarily determined by the fundamental frequency (how fast the vocal folds vibrate) Speech rate: How fast someone is talking Duration: How long individual sounds or syllables are held Loudness: The overall volume or intensity of the speech These prosodic features serve multiple functions. They can signal stress (which syllable in a word should be emphasized), pitch accents (where melodic emphasis falls), and intonation patterns (the overall melody of a sentence, like whether it's a statement or a question). For example, in English, the difference between "PREsent" (noun) and "preSENT" (verb) is primarily a matter of prosodic stress, even though the consonants and vowels are the same. Prosody is also culturally specific—different languages use pitch and timing patterns in different ways, and this is something learners must acquire when learning a non-native language. Theories of Speech Perception How exactly do our brains convert the acoustic signal into linguistic meaning? This is a major question in speech science, and researchers have proposed several competing theories. Understanding these theories is essential because they explain different aspects of how we decode speech. Motor Theory The motor theory of speech perception proposes something counterintuitive: to understand a speech sound, listeners don't just passively hear it—instead, they access the articulatory gestures (the movements of the mouth, tongue, vocal folds, etc.) that would be needed to produce that sound themselves. In other words, we understand speech by internally simulating how we would produce it. This theory has an appealing logic: it explains why we're particularly good at understanding our native language (we've practiced producing those sounds ourselves) and why speech perception seems to involve motor areas of the brain, not just auditory areas. However, strong versions of this theory face a challenge: we can perceive speech sounds that we cannot produce (like hearing someone with a very different voice), and even young infants who cannot yet produce speech can perceive it. This has led to weaker forms of motor theory, which propose that there is a nondeterministic (not perfectly predictable) relationship between production and perception. In other words, articulatory information helps inform perception, but perception isn't purely dependent on motor simulation—other information matters too. Abstractionist Theories Abstractionist theories take a different approach. Rather than focusing on how sounds are produced, they focus on what listeners extract from the acoustic signal. These theories propose that perception involves: Extracting an idealized lexical representation—a abstract, simplified version of what the word should sound like Normalizing acoustic variability—accounting for the fact that the same word sounds different when spoken by different speakers, in different contexts, or with different intonations In other words, abstractionists argue that listeners don't store or remember the specific acoustic details of speech. Instead, they extract more abstract, generalized representations that capture the essential information needed to distinguish one word from another. This approach explains why we can understand speakers with very different voices (we abstract away the speaker-specific details) and why we interpret the same acoustic signal differently depending on context. Episodic and Exemplar Theories Episodic theories (also called exemplar theories) propose something quite different: listeners actually do store detailed memory traces of previously heard speech tokens. When you encounter a new word, you compare it against all the previous examples of that word you've heard—what linguists call "exemplars" or detailed memories. According to this approach, listeners use familiarity—essentially how well the current token matches the exemplars in memory—to categorize what they're hearing and resolve variability. A familiar accent or pronunciation is easier to understand because it matches exemplars in memory. An unfamiliar accent is harder because there are fewer similar exemplars to compare against. This theory has interesting implications: it suggests that our perception of speech changes throughout our lives as we encounter new speakers and accents, and it explains why we gradually adjust to understanding non-native speakers or speakers with different dialects. Comparing the Theories These theories are not mutually exclusive. Modern speech perception science increasingly suggests that perception likely involves multiple processes operating in parallel: Some perceptual decisions may involve motorically-based information Some may involve abstracting idealized representations Some may involve comparing against episodic memories The relative contribution of each mechanism likely varies depending on factors like the clarity of the signal, whether the speech is from a familiar or unfamiliar speaker, and the linguistic experience of the listener.

Flashcards

What is the most common airstream mechanism used in speech?

Pulmonic

What physical action creates the periodic waveform and fundamental frequency in voiced sounds?

Vocal fold vibration

What acoustic feature is generated by voiceless fricatives due to the lack of vocal-fold vibration?

Turbulence

What acoustic state characterizes voiceless plosives?

Silence

What are the four primary levels of vowel height (vertical tongue position)?

High (close) Close-mid Open-mid Low (open)

What are the three classifications for vowel backness (horizontal tongue position)?

Front Central Back

Which articulatory feature often correlates with vowel height and backness?

Lip rounding

What term refers to the resonant frequencies of the vocal tract that characterize vowel quality?

Formants

Which specific formants are the primary acoustic correlates used to distinguish vowels?

$F1$ and $F2$ (the first and second formants)

What is the definition of speech perception?

The process of decoding an acoustic signal into discrete linguistic units (phonemes, morphemes, words)

How is the vibration of the eardrum transferred to the cochlea?

By the middle-ear ossicles

Which cells on the basilar membrane transduce mechanical vibrations into neural signals?

Hair cells

Through which nerve do neural signals from the cochlea travel to the brainstem?

Auditory nerve

What four acoustic features are considered prosodic features?

Pitch Speech rate Duration Loudness

According to Motor Theory, how do listeners categorize sounds?

By accessing the articulatory gestures that would produce them

How do weaker forms of Motor Theory characterize the relationship between production and perception?

As nondeterministic

What does Abstractionist theory argue is the primary goal of speech perception?

Extracting an idealized lexical representation and normalizing acoustic variability

Upon what does Episodic (exemplar) theory contend that speech perception relies?

Detailed memory traces of previously heard tokens

Quiz

Phonetics - Acoustic and Auditory Foundations Quiz Question 1: Besides the lungs, which two articulators can serve as airstream sources for certain speech sounds?

Glottis and tongue (correct)
Lips and teeth
Velum and uvula
Alveolar ridge and palate

Phonetics - Acoustic and Auditory Foundations Quiz Question 2: How do voiceless fricatives differ acoustically from voiceless plosives?

Fricatives generate turbulent noise (correct)
Plosives are silent
Both produce continuous voicing
Both have identical acoustic profiles

Phonetics - Acoustic and Auditory Foundations Quiz Question 3: Which term describes the vertical position of the tongue in vowel articulation?

Vowel height (correct)
Vowel backness
Lip rounding
Nasality

Phonetics - Acoustic and Auditory Foundations Quiz Question 4: Which vowel feature involves shaping the lips as rounded or unrounded?

Lip rounding (correct)
Vowel height
Vowel backness
Nasality

Phonetics - Acoustic and Auditory Foundations Quiz Question 5: What are the resonant frequencies of the vocal tract that determine vowel quality called?

Formants (correct)
Harmonics
Phonemes
Glottal stops

Phonetics - Acoustic and Auditory Foundations Quiz Question 6: In the auditory pathway, which structures transmit vibrations from the eardrum to the cochlea?

Middle‑ear ossicles (correct)
Auditory nerve
Basilar membrane
Semicircular canals

Phonetics - Acoustic and Auditory Foundations Quiz Question 7: Which cells convert mechanical vibrations in the cochlea into neural signals?

Hair cells (correct)
Cochlear bones
Auditory‑cortex neurons
Epithelial cells

Phonetics - Acoustic and Auditory Foundations Quiz Question 8: What does a weaker form of motor theory propose about the relationship between speech production and perception?

It is nondeterministic (correct)
It is strictly deterministic
It is unrelated
It depends on visual feedback

Phonetics - Acoustic and Auditory Foundations Quiz Question 9: Which acoustic parameters are primarily used to differentiate vowel sounds?

The first two formants (F1 and F2) (correct)
Fundamental frequency (F0)
The third and fourth formants (F3 and F4)
Overall amplitude of the vowel

Phonetics - Acoustic and Auditory Foundations Quiz Question 10: During speech perception, listeners convert the acoustic signal into which types of linguistic units?

Phonemes, morphemes, and words (correct)
Syllables and phrases
Gestures and facial expressions
Muscle movements

Phonetics - Acoustic and Auditory Foundations Quiz Question 11: Which of the following groups lists only prosodic features of speech?

Pitch, speech rate, duration, loudness (correct)
Vowel height, nasalization, voice onset time, aspiration
Consonant place, manner, voicing, aspiration
Word stress, morphological case, lexical tone, syllable count

Besides the lungs, which two articulators can serve as airstream sources for certain speech sounds?

1 of 11

Key Concepts

Speech Production Mechanisms

Pulmonic airstream mechanism

Voicing (phonation)

Vowel height

Vowel backness

Vowel nasality

Speech Perception Theories

Motor theory of speech perception

Episodic (exemplar) theory of speech perception

Acoustic Features of Speech

Formant

Cochlea

Prosody

Definitions

Pulmonic airstream mechanism

The most common speech airstream source, using airflow from the lungs to produce sounds.

Voicing (phonation)

The process by which vocal‑fold vibration creates periodic acoustic waves for voiced sounds.

Vowel height

A classification of vowels based on the vertical position of the tongue (high, close‑mid, open‑mid, low).

Vowel backness

A classification of vowels based on the horizontal position of the tongue (front, central, back).

Formant

Resonant frequencies of the vocal tract that shape vowel quality, especially the first two (F1, F2).

Cochlea

A spiral organ in the inner ear where the basilar membrane creates a tonotopic map converting sound frequencies into neural signals.

Prosody

The suprasegmental features of speech such as pitch, duration, loudness, and rhythm that convey stress and intonation.

Motor theory of speech perception

A theory proposing that listeners perceive speech by accessing the articulatory gestures that would produce the sounds.

Episodic (exemplar) theory of speech perception

A theory asserting that speech perception relies on detailed memory traces of specific heard instances.

Vowel nasality

A phonetic feature where airflow is directed through the nasal cavity, producing nasal vowels in some languages.