RemNote Community
Community

Phonetics - Speech Production Mechanisms

Understand the stages of speech production, the anatomy and physics of the vocal system, and how places and manners of articulation are classified.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What occurs during the retrieval and assignment of phonological word forms?
1 of 31

Summary

Speech Production Process Introduction Speech production is a complex, multi-stage process that transforms our thoughts into acoustic signals that others can hear. Understanding this process requires knowledge of how we plan and encode speech, how our respiratory and laryngeal systems generate sound, and how we shape that sound with our articulators (lips, tongue, etc.) to create the consonants and vowels of language. The Four Stages of Speech Production Speech production unfolds through four sequential stages, each building on the previous one: Stage 1: Phonological Encoding The first stage maps selected words onto their phonological representations—sequences of phonemes. This stage answers the question: "What sounds do I need to produce?" For example, producing the word "cat" requires retrieving its phoneme sequence: /k/ + /æ/ + /t/. Stage 2: Articulatory Specification Once phonemes are selected, each one must be translated into concrete articulatory instructions. This stage specifies exactly which articulatory features are needed, such as whether the lips should be closed, where the tongue should be positioned, or how much constriction should occur. The phoneme /p/ requires the two lips to come together completely, while /f/ requires the lower lip to contact the upper teeth. Stage 3: Motor Commands The articulatory specifications are converted into precise commands sent to the speech muscles. These commands activate the muscles controlling the lungs, larynx, jaw, lips, and tongue in a carefully timed sequence. Stage 4: Articulation Finally, the muscles execute the commands, producing the physical movements that generate speech sounds. This is the only stage that actually produces audible acoustic energy. The Respiratory Foundation: Pulmonary and Subglottal Systems Pulmonic Airflow The vast majority of speech sounds worldwide are pulmonic egressive sounds, meaning they are produced by exhaling air from the lungs. This pushes air upward through the vocal tract, which can then be shaped into various consonants and vowels. Interestingly, while some languages use ingressive sounds (sounds made while breathing in) for paralinguistic purposes—such as producing the "tsk" sound for disapproval or the click of agreement—no known language uses pulmonic ingressive sounds as regular phonemes in its sound inventory. Subglottal Pressure and Suprasegmental Features The pressure of air flowing from the lungs (called subglottal pressure) is not constant. Fine adjustments to this pressure are made during speech to modify suprasegmental features—features that extend across multiple sounds, such as stress and intonation. For example, when we emphasize a particular syllable, we briefly increase subglottal pressure to make it louder and more prominent. The Larynx: Anatomy and Phonation Laryngeal Structure The larynx, commonly called the voice box, is a cartilaginous structure housed within the trachea (windpipe). Its most important feature for speech is that it contains the vocal folds (also called vocal cords), which are two small muscular folds that can vibrate to produce voiced sounds. The position and tension of the vocal folds are controlled by the arytenoid cartilages, which can move to bring the folds closer together, pull them apart, or adjust their tension. These movements allow us to produce different types of phonation. Phonation Types Phonation refers to how the vocal folds vibrate (or don't vibrate) during sound production. The larynx can operate in several distinct modes: Modal Voice This is the typical phonation used in everyday speech. The vocal folds are positioned close together with moderate tension. They vibrate regularly and periodically, producing a clear periodic sound that contains definable pitch. Most vowels and voiced consonants use modal voice. Breathy Voice In breathy voice, the vocal folds are held slightly more apart than in modal voice, leaving a small gap between them. Because the folds don't come completely together, they vibrate less regularly and more noisily. This produces a distinctly "breathy" quality—you can hear aspiration (turbulent airflow) mixed in with the periodic vibration. In acoustic analysis, breathy voice shows reduced amplitude (quietness) in the first formant compared to modal voice. Creaky Voice Creaky voice (also called laryngealization) occurs when the vocal folds are pressed tightly together but with very low tension. This produces irregular, low-frequency vibrations that sound like popcorn popping or creaking wood. The acoustic signal is more chaotic than in modal voice. Voiceless Glottal Stop When the vocal folds are pressed firmly shut, no air can pass through the larynx and the folds cannot vibrate at all. This produces the voiceless glottal stop [ʔ], an absence of sound used as a consonant in many languages, including English (heard in "uh-oh"). Source-Filter Theory: The Fundamental Model The Core Concept The source-filter theory provides a powerful way to understand speech production acoustically. The theory states that speech can be decomposed into two independent components: The source: Usually the larynx, which generates the raw acoustic energy. This might be the periodic vibration of the vocal folds (for voiced sounds) or turbulent airflow (for fricatives). The filter: The supraglottal vocal tract (everything above the larynx), which shapes and modifies the source energy. Think of it like an audio speaker and an equalizer: the speaker generates raw sound, and the equalizer modifies which frequencies are boosted or reduced. The Vocal Tract as an Acoustic Filter The vocal tract—the space from the larynx to the lips—can be modeled as a series of tubes with varying diameters. The shape of this tube system creates resonances: certain frequencies that are naturally amplified while others are dampened. These resonances are critical for creating the distinct quality of different vowels and consonants. For example, when you say the vowel /i/ (as in "fleece"), your tongue is positioned high and forward, creating a particular tube shape with characteristic resonances. When you say /u/ (as in "goose"), your tongue is high and back, creating a different tube shape and different resonances. Your ears hear these resonance differences as distinctly different vowels. Inverse Filtering One powerful application of source-filter theory is inverse filtering, a signal processing technique that removes the predicted filter effect to reveal what the source alone produces. By measuring the acoustic output and mathematically removing the estimated vocal-tract filtering, researchers can isolate the glottal source spectrum—the raw sound produced by the vocal folds. This technique is invaluable for quantitative analysis. For instance, researchers can use inverse filtering to measure how the spectral properties differ between modal voice, breathy voice, and creaky voice by examining only the source, separated from vocal-tract effects. Places of Articulation General Principle Consonants are created by constricting (narrowing) the vocal tract at some location. The place of articulation is where this constriction occurs. Different places of articulation create perceptually distinct sounds. The constrictions can be made with the lips (labial), with the tongue (lingual), or at the larynx itself. Labial Articulations Bilabial consonants are produced with both lips contacting each other, completely blocking airflow. English examples include /p/, /b/, and /m/. Labiodental consonants involve the lower lip contacting the upper front teeth, creating a narrow constriction. English examples are /f/ and /v/. Coronal Articulations Coronal consonants are produced with the tongue tip or blade (the foremost parts of the tongue). This category includes several subcategories: Dental consonants place the tongue against the upper front teeth. Apical dentals use the tongue tip, while interdental (as in English "th" sounds like /θ/) use the blade positioned between the teeth. Alveolar consonants are produced with the tongue tip or blade against the alveolar ridge, the bumpy area just behind the upper front teeth. Most English consonants like /t/, /d/, /n/, /s/, and /z/ are alveolar. Post-alveolar consonants occur just behind the alveolar ridge. Apical post-alveolar consonants (called retroflex consonants) involve the tongue tip curled upward, either on the underside of the tip (sub-apical) or with the tip itself making contact. Laminal post-alveolar consonants (called palato-alveolar) use the blade of the tongue and are common in English, as in the initial consonant of "shoe" (/ʃ/). Dorsal Articulations Dorsal consonants are produced with the tongue body (the main bulk of the tongue): Palatal consonants place the tongue body against the hard palate (the bony roof of the mouth). Velar consonants place the tongue body against the velum (the soft palate—the fleshy back of the roof of the mouth). The voiceless velar stop /k/ and voiced velar stop /g/ are among the most common consonants across languages. In fact, almost all languages have at least one velar stop. Uvular consonants place the tongue body further back, contacting the uvula (the small fleshy extension hanging from the soft palate). Uvular consonants are less common globally, appearing in only about 19% of languages, but they occur in French, Arabic, and Inuit languages. Pharyngeal, Glottal, and Radical Articulations Pharyngeal consonants are produced by retracting the root of the tongue backward toward the pharyngeal wall, creating a narrow constriction in the pharynx. Because of the location and the difficulty in creating complete closure, pharyngeal consonants can only be fricatives (narrow constrictions creating turbulence) or approximants (very narrow constrictions). Glottal consonants are produced using the vocal folds themselves as articulators. The most common is the voiceless glottal stop [ʔ], which serves as a phoneme in many languages including Arabic and English (in "uh-oh"). Manner of Articulation Introduction to Manner While place of articulation answers "where" is the constriction made, manner of articulation answers "what type" of constriction is made. The same place can produce different sounds depending on the manner. Stops (Plosives) Stops (also called plosives) completely obstruct the airstream at some place of articulation. This blockage traps air, which builds up pressure in the vocal tract. When the closure is released, the trapped air bursts out, creating the characteristic explosive sound. When the velum (soft palate) is raised, it seals off the nasal cavity, forcing all airflow through the mouth—these are called oral stops. The English sounds /p/, /b/, /t/, /d/, /k/, and /g/ are all oral stops at different places of articulation. Nasals Nasals also have complete oral closure (the velum is lowered), so airflow cannot escape through the mouth. However, the lowered velum opens the nasal cavity, allowing all airflow to go through the nose instead. English /m/, /n/, and /ŋ/ (as in "singing") are nasals. Affricates Affricates combine two manners: they begin as a stop (complete blockage) but instead of releasing with a burst, they release into a fricative (narrow constriction). The key feature is that the stop and the following fricative occur at the same place of articulation. English /tʃ/ (as in "church") is an affricate: it starts with a stop at the alveolar ridge (/t/) and releases into a fricative at the same place (/ʃ/). Fricatives Fricatives create turbulent, noisy airflow by partially narrowing (but not completely blocking) the vocal tract. The narrowing causes the air to flow rapidly, creating acoustic turbulence that produces a hissing or whooshing sound. English fricatives include /f/, /v/, /θ/, /ð/, /s/, /z/, /ʃ/, /ʒ/, and /h/. Sibilants Sibilants are a special subtype of fricatives that direct the turbulent airflow specifically toward the teeth. This focus creates a high-pitched, clear hissing sound. English sibilants include /s/, /z/, /ʃ/, and /ʒ/. Non-sibilant fricatives like /f/ and /v/ produce turbulence but don't direct it toward the teeth, so they lack the sharp hissy quality. Articulatory Models: The Gestural Approach Beyond Linear Models Traditional descriptions of consonants and vowels treat them as static positions—a consonant is defined by its place and manner, and we might imagine the articulators move to that position, hold it, then move away. However, this linear view misses important dynamic aspects of speech. Gestural Units The gestural approach to articulation proposes that the fundamental unit of speech is not the phoneme but the gesture: a coordinated pattern of muscle activity directed toward a specific speech goal. A gesture might be "close the lips," "raise the velum," or "advance the tongue." Importantly, individual gestures are often executed simultaneously, not sequentially. For example, producing /p/ requires two gestures happening at the same time: closing the lips (oral closure) and raising the velum (to prevent nasal airflow). Coarticulation as Gesture Overlap This framework elegantly explains coarticulation—the phenomenon where sounds influence each other's pronunciation depending on neighboring sounds. In the linear view, coarticulation seems like a mysterious intrusion of one sound into another. In the gestural view, it's simply the natural consequence of how gestures overlap in time. As we speak faster, gesture onsets begin earlier relative to when the previous gesture ends, creating more overlap. This overlap produces the acoustic changes we perceive as coarticulation. For example, when you say the words "key" versus "caw," the /k/ sounds different because the tongue positioning for the following vowel overlaps with the /k/ closure in time. In "key," the tongue is already moving toward the front (for /i/) while still executing the velar closure. In "caw," the tongue is already moving back (for /ɔ/) during the /k/. The same place and manner of articulation, but different acoustic results due to gesture overlap.
Flashcards
What occurs during the retrieval and assignment of phonological word forms?
Selected words are encoded as sequences of phonemes.
What is the function of articulatory specification in speech production?
It assigns specific articulatory features (like lip closure) to each phoneme.
What is the role of muscle commands in the speech production process?
They instruct speech muscles to execute articulatory gestures.
How is articulation defined in the context of language production?
The physical execution that generates speech sounds.
What is the specific mapping performed during phonological encoding?
Mapping each lexical item to its corresponding phoneme sequence.
How are pulmonic egressive sounds produced?
By exhaling air from the lungs.
Where are the vocal folds housed?
In the larynx (the voice box).
What is the function of the arytenoid cartilages?
To adjust the position and tension of the vocal folds.
How are the vocal folds positioned during modal voice phonation?
Close together with moderate tension.
What are the acoustic characteristics of a breathy voice?
A noisy waveform and reduced first formant amplitude.
What vocal fold conditions produce a creaky voice?
Folds are tightly together with low tension.
What happens to the vocal folds during a voiceless glottal stop?
They are tightly closed, preventing vibration.
What are the two main components of the source–filter model?
A noise source (larynx) and an acoustic filter (supraglottal vocal tract).
How is the vocal tract modeled as an acoustic filter?
As a series of tubes closed at one end with varying diameters.
What is the purpose of inverse filtering in speech analysis?
To reveal the source spectrum by removing the vocal-tract filter.
How are bilabial consonants produced?
With both lips contacting each other.
Which articulators contact each other for labiodental consonants?
The lower lip and the upper teeth.
What is the difference between apical and interdental dental consonants?
Apical uses the tongue tip; interdental uses the tongue blade.
Where is the constriction located for alveolar consonants?
At the alveolar ridge.
What is the distinct articulatory feature of retroflex consonants?
The tongue tip is curled upward.
What part of the tongue is used to produce palatal consonants?
The tongue body.
Which place of articulation involves contact with the velum?
Velar.
How are pharyngeal consonants produced?
By retracting the root of the tongue toward the pharyngeal wall.
What physical process produces the 'burst' in a stop consonant?
The release of built-up pressure after a complete airstream obstruction.
What position of the velum is required to produce oral stops?
Raised (to block the nasal cavity).
How is airflow directed during the production of nasals?
Through the nose (due to a lowered velum and oral closure).
What two components make up an affricate?
A stop followed by a fricative at the same place of articulation.
How do fricatives generate sound?
By creating turbulent airflow through a partially narrowed vocal tract.
What defines sibilants as a specific subtype of fricatives?
They direct turbulent airflow toward the teeth.
What is a 'gesture' in gestural articulatory models?
A coordinated pattern of muscle activity directed toward a speech goal.
How do gestural models explain the phenomenon of coarticulation?
As the overlap of independent gestures at faster speech rates.

Quiz

Which type of speech sounds are produced by exhaling air from the lungs?
1 of 17
Key Concepts
Speech Production Mechanisms
Speech production process
Pulmonary egressive sounds
Larynx
Phonation types
Source‑filter theory
Inverse filtering
Articulation Features
Place of articulation
Manner of articulation
Gestural model
Subglottal pressure