Subjects/Science/Computer and Information Science/Computer Science/Machine translation

Machine translation Study Guide

Study Guide

📖 Core Concepts Machine Translation (MT) – Use of computers to convert text or speech from a source language to a target language, aiming to preserve meaning, idiom, and pragmatics. Rule‑Based MT – Relies on hand‑crafted dictionaries, grammar rules, and intermediate representations (transfer‑based, interlingual). Statistical MT (SMT) – Learns translation probabilities from large bilingual corpora; performance hinges on the amount and quality of parallel data. Neural MT (NMT) – End‑to‑end deep‑learning models that encode a source sentence and decode it into the target language. Prompt‑Based LLM Translation – Large language models (e.g., GPT) are asked to translate via natural‑language prompts instead of being fine‑tuned on parallel data. Post‑editing – Human review and correction of MT output; the most reliable way to reach professional quality, especially for sensitive domains. 📌 Must Remember Human parity claims for NMT are limited to narrow domains and language pairs – do not assume universal parity. Optimal parallel‑sentence count ≈ 100 k pairs; too few → under‑fit, too many → diminishing returns/possible degradation. BLEU, NIST, METEOR, LEPOR are the standard automated metrics; BLEU compares n‑gram overlap, NIST weights rarer n‑grams higher. Low‑resource languages suffer from scarce parallel corpora → lower MT accuracy. Domain‑specific customization (legal, medical, technical) dramatically improves quality over generic models. 🔄 Key Processes Rule‑Based Translation Pipeline Lexicon lookup → Morphological analysis → Syntactic parsing → Transfer to intermediate representation → Generation in target language. Statistical MT Training Collect parallel corpus → Align sentences → Estimate phrase‑translation probabilities → Build language model for target → Decode using beam search. Neural MT (Seq2Seq) Training Encode source tokens → Produce context vectors (attention) → Decode token‑by‑token → Optimize cross‑entropy loss over parallel pairs. Prompting an LLM for Translation Craft prompt: “Translate the following English sentence into French: ‘…’ ” → Send to model → Retrieve generated target text → (Optional) post‑edit. 🔍 Key Comparisons Rule‑Based vs. Statistical – Hand‑crafted rules vs. data‑driven probability tables. Statistical vs. Neural – Phrase‑based probabilities vs. end‑to‑end learned representations; NMT usually higher fluency. Neural MT vs. Prompted LLM – Dedicated NMT model is more compute‑efficient; LLM offers flexibility but higher resource cost. Domain‑General vs. Domain‑Specific MT – General models are versatile but less accurate on specialized terminology; domain‑specific models excel on jargon. ⚠️ Common Misunderstandings “MT = human‑level quality.” – Only true for narrow, well‑studied pairs; most outputs still need human post‑editing. “More data always helps.” – Past 100 k sentence pairs, extra data can introduce noise and hurt performance. “BLEU = true quality.” – BLEU measures surface similarity; it misses semantic adequacy, cultural nuance, and named‑entity handling. 🧠 Mental Models / Intuition “Translation as a pipeline vs. a single brain.” – Rule‑based = assembly line; Neural = a single brain that sees the whole sentence at once. “Data = diet.” – A well‑balanced, domain‑matched corpus feeds a healthy MT model; a junk‑food corpus leads to poor health (errors). 🚩 Exceptions & Edge Cases Morphologically rich target languages – SMT struggles; NMT mitigates but still needs subword segmentation. Non‑standard speech / slang – All MT systems perform poorly; rule‑based lack coverage, statistical/Neural lack training examples. Transliteration vs. Translation – Names may need phonetic rendering (transliteration) rather than semantic translation; wrong choice reduces readability. 📍 When to Use Which Rule‑Based – Small, closed‑domain projects with strict terminology and limited language pairs. Statistical – When a large, clean parallel corpus exists but computational resources are modest. Neural – Default choice for most modern applications; especially when fluency matters. Prompted LLM – Rapid prototyping, low‑resource languages (leveraging massive pre‑training), or when no parallel data are available. Hybrid (Rule + Stat + Neural) – High‑stakes domains (legal, medical) where each method’s strengths can compensate for the others. 👀 Patterns to Recognize Repetition of “domain‑specific” phrasing → Expect a recommendation for custom training data. Mentions of “parallel corpus” size → Check if it falls near the 100 k optimal range. References to “named entity” → Anticipate a need for NER or class‑based token replacement before translation. 🗂️ Exam Traps Choosing BLEU as the sole metric – Exam may ask why BLEU is insufficient for literary or legal texts. Assuming “large language model = best MT” – They are powerful but still lag behind specialized NMT in many benchmarks. Confusing “translation” with “transliteration.” – Answers that treat all proper nouns as translated are wrong; the correct choice mentions phonetic conversion when appropriate. Over‑stating human‑parity – Any statement that NMT has universally reached human parity is a distractor. Ignoring low‑resource constraints – Selecting a method that requires massive parallel data for a low‑resource language will be penalized.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or