Natural language processing Study Guide
Study Guide
📖 Core Concepts
Natural Language Processing (NLP) – Computer‑based processing of human language; a subfield of Artificial Intelligence that overlaps with information retrieval, computational linguistics, and knowledge representation.
Major NLP Tasks
Speech recognition → spoken → text.
Text classification → assign predefined categories.
Natural language understanding → interpret meaning.
Natural language generation → produce human‑like text from data.
Historical Eras
Symbolic (1950s‑early‑1990s) – Hand‑crafted rules, e.g., Chinese Room, Lesk algorithm.
Statistical (1990s‑present) – Probabilistic models, IBM alignment models, HMM POS taggers.
Neural‑Network (2010s‑present) – Word2vec embeddings, seq2seq translation, transformer models.
Three Main Approaches
Symbolic / Rule‑Based – Explicit linguistic rules, useful when data are scarce.
Statistical / Machine‑Learning – Learn probabilities from corpora; more robust to noise.
Neural / Deep Learning – End‑to‑end learned representations; capture long‑range dependencies.
---
📌 Must Remember
Definition: NLP = processing natural language information by a computer.
Key Milestones
1950 – Turing test (automated language interpretation/generation).
1980s – Quantitative evaluation pushes statistical shift.
1990s – IBM alignment models → statistical MT.
2010 – Mikolov’s RNN language model → Word2vec embeddings.
2010s – Transformer architectures dominate.
Approach Trade‑offs
Symbolic = high expert effort, low data needs.
Statistical = better scalability, needs annotated corpora.
Neural = highest accuracy, requires massive data & compute.
Word Embeddings: Continuous vectors where cosine similarity ≈ semantic similarity.
Sequence‑to‑Sequence (seq2seq): Encoder–decoder networks that map an input sequence (e.g., source sentence) directly to an output sequence (e.g., translation).
---
🔄 Key Processes
Speech Recognition Pipeline
Audio capture → feature extraction (e.g., MFCC) → acoustic model (often a neural net) → language model → decoded text.
Text Classification
Raw text → tokenization → vectorization (e.g., TF‑IDF or embeddings) → classifier (logistic regression, SVM, or neural net) → label output.
Training Word2vec (skip‑gram)
Input word → project to embedding → predict surrounding context words → update embeddings via gradient descent.
Seq2seq Machine Translation
Encoder RNN/Transformer reads source sentence → creates context vector → Decoder RNN/Transformer generates target sentence token‑by‑token, using attention to focus on relevant source parts.
---
🔍 Key Comparisons
Symbolic vs. Statistical vs. Neural
Data requirement: low → moderate → high.
Feature engineering: hand‑crafted → learned probabilities → learned representations.
Typical performance: limited → good → state‑of‑the‑art.
Rule‑Based vs. Statistical POS Tagging
Rule‑Based: explicit tag rules, brittle on unseen words.
Statistical (HMM): learns tag transition/emission probabilities, handles unknowns better.
Supervised vs. Unsupervised Learning
Supervised: uses hand‑annotated labels → higher per‑sample accuracy.
Unsupervised: no labels → cheaper data, lower accuracy per unit data.
---
⚠️ Common Misunderstandings
“Neural nets need no data.” – They actually demand large labeled or unlabeled corpora plus compute.
“Statistical = machine learning.” – Early statistical methods (e.g., IBM alignment) were rule‑free but not always “learning” in the modern ML sense.
“Rule‑based systems are always inferior.” – For extremely low‑resource languages, rule‑based pipelines can outperform data‑hungry neural models.
“Word2vec is a deep network.” – It is a shallow two‑layer model; depth comes later in transformer‑based embeddings.
---
🧠 Mental Models / Intuition
Language ↔ Vector Space: Imagine every word as a point in a high‑dimensional room; distances capture meaning.
Pipeline → End‑to‑End: Traditional NLP = assembly line (tokenizer → POS → parser …); modern NLP = a single factory where raw text goes in and the desired output pops out.
Attention = Spotlight: In seq2seq, the decoder shines a spotlight on relevant encoder states, like a reader focusing on specific words when translating.
---
🚩 Exceptions & Edge Cases
Low‑resource languages – Rule‑based preprocessing (tokenization, morphological analysis) may be the only viable option.
Unsupervised methods – Often lag behind supervised counterparts on accuracy, but they can be the only path when annotation is impossible.
Compute‑heavy models – Transformers deliver top performance but may be impractical on limited hardware; smaller RNNs or statistical models become fallback choices.
---
📍 When to Use Which
Rule‑Based → No sizable corpus, domain‑specific expert knowledge, quick prototype for rare languages.
Statistical (HMM, CRF, etc.) → Moderate labeled data, need robustness to noisy input (misspellings).
Neural (embeddings, Transformers) → Large annotated or raw corpora, high‑accuracy demand, resources for GPU/TPU training.
Unsupervised / Semi‑Supervised → Massive unlabeled web text, pre‑training stage before fine‑tuning on a small labeled set.
---
👀 Patterns to Recognize
“Sequence‑to‑sequence” → Look for encoder‑decoder architecture, often paired with attention → likely translation or summarization task.
“Embedding + cosine similarity” → Indicates lexical semantics or similarity‑based retrieval.
“Parse tree / dependency graph” → Syntactic analysis problem.
“Multi‑modal” → Presence of text + audio/video → expect multimodal fusion models.
---
🗂️ Exam Traps
Distractor: “Word2vec is a deep neural network.” – It is a shallow model; depth appears later in transformer‑based embeddings.
Distractor: “Statistical models cannot handle misspellings.” – Actually, probabilistic models are more robust to noisy input than rule‑based ones.
Distractor: “Rule‑based NLP always outperforms neural methods.” – True only in extreme low‑data scenarios; otherwise neural approaches dominate benchmarks.
Distractor: “Morphological analysis is the same as lexical semantics.” – Morphology deals with word form structure; lexical semantics concerns word meaning in context.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or