Machine translation Study Guide
Study Guide
📖 Core Concepts
Machine Translation (MT) – Use of computers to convert text or speech from a source language to a target language, aiming to preserve meaning, idiom, and pragmatics.
Rule‑Based MT – Relies on hand‑crafted dictionaries, grammar rules, and intermediate representations (transfer‑based, interlingual).
Statistical MT (SMT) – Learns translation probabilities from large bilingual corpora; performance hinges on the amount and quality of parallel data.
Neural MT (NMT) – End‑to‑end deep‑learning models that encode a source sentence and decode it into the target language.
Prompt‑Based LLM Translation – Large language models (e.g., GPT) are asked to translate via natural‑language prompts instead of being fine‑tuned on parallel data.
Post‑editing – Human review and correction of MT output; the most reliable way to reach professional quality, especially for sensitive domains.
📌 Must Remember
Human parity claims for NMT are limited to narrow domains and language pairs – do not assume universal parity.
Optimal parallel‑sentence count ≈ 100 k pairs; too few → under‑fit, too many → diminishing returns/possible degradation.
BLEU, NIST, METEOR, LEPOR are the standard automated metrics; BLEU compares n‑gram overlap, NIST weights rarer n‑grams higher.
Low‑resource languages suffer from scarce parallel corpora → lower MT accuracy.
Domain‑specific customization (legal, medical, technical) dramatically improves quality over generic models.
🔄 Key Processes
Rule‑Based Translation Pipeline
Lexicon lookup → Morphological analysis → Syntactic parsing → Transfer to intermediate representation → Generation in target language.
Statistical MT Training
Collect parallel corpus → Align sentences → Estimate phrase‑translation probabilities → Build language model for target → Decode using beam search.
Neural MT (Seq2Seq) Training
Encode source tokens → Produce context vectors (attention) → Decode token‑by‑token → Optimize cross‑entropy loss over parallel pairs.
Prompting an LLM for Translation
Craft prompt: “Translate the following English sentence into French: ‘…’ ” → Send to model → Retrieve generated target text → (Optional) post‑edit.
🔍 Key Comparisons
Rule‑Based vs. Statistical – Hand‑crafted rules vs. data‑driven probability tables.
Statistical vs. Neural – Phrase‑based probabilities vs. end‑to‑end learned representations; NMT usually higher fluency.
Neural MT vs. Prompted LLM – Dedicated NMT model is more compute‑efficient; LLM offers flexibility but higher resource cost.
Domain‑General vs. Domain‑Specific MT – General models are versatile but less accurate on specialized terminology; domain‑specific models excel on jargon.
⚠️ Common Misunderstandings
“MT = human‑level quality.” – Only true for narrow, well‑studied pairs; most outputs still need human post‑editing.
“More data always helps.” – Past 100 k sentence pairs, extra data can introduce noise and hurt performance.
“BLEU = true quality.” – BLEU measures surface similarity; it misses semantic adequacy, cultural nuance, and named‑entity handling.
🧠 Mental Models / Intuition
“Translation as a pipeline vs. a single brain.” – Rule‑based = assembly line; Neural = a single brain that sees the whole sentence at once.
“Data = diet.” – A well‑balanced, domain‑matched corpus feeds a healthy MT model; a junk‑food corpus leads to poor health (errors).
🚩 Exceptions & Edge Cases
Morphologically rich target languages – SMT struggles; NMT mitigates but still needs subword segmentation.
Non‑standard speech / slang – All MT systems perform poorly; rule‑based lack coverage, statistical/Neural lack training examples.
Transliteration vs. Translation – Names may need phonetic rendering (transliteration) rather than semantic translation; wrong choice reduces readability.
📍 When to Use Which
Rule‑Based – Small, closed‑domain projects with strict terminology and limited language pairs.
Statistical – When a large, clean parallel corpus exists but computational resources are modest.
Neural – Default choice for most modern applications; especially when fluency matters.
Prompted LLM – Rapid prototyping, low‑resource languages (leveraging massive pre‑training), or when no parallel data are available.
Hybrid (Rule + Stat + Neural) – High‑stakes domains (legal, medical) where each method’s strengths can compensate for the others.
👀 Patterns to Recognize
Repetition of “domain‑specific” phrasing → Expect a recommendation for custom training data.
Mentions of “parallel corpus” size → Check if it falls near the 100 k optimal range.
References to “named entity” → Anticipate a need for NER or class‑based token replacement before translation.
🗂️ Exam Traps
Choosing BLEU as the sole metric – Exam may ask why BLEU is insufficient for literary or legal texts.
Assuming “large language model = best MT” – They are powerful but still lag behind specialized NMT in many benchmarks.
Confusing “translation” with “transliteration.” – Answers that treat all proper nouns as translated are wrong; the correct choice mentions phonetic conversion when appropriate.
Over‑stating human‑parity – Any statement that NMT has universally reached human parity is a distractor.
Ignoring low‑resource constraints – Selecting a method that requires massive parallel data for a low‑resource language will be penalized.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or