Subjects/Science/Computer and Information Science/Computer Science/Natural language processing

Natural language processing Study Guide

Study Guide

📖 Core Concepts Natural Language Processing (NLP) – Computer‑based processing of human language; a subfield of Artificial Intelligence that overlaps with information retrieval, computational linguistics, and knowledge representation. Major NLP Tasks Speech recognition → spoken → text. Text classification → assign predefined categories. Natural language understanding → interpret meaning. Natural language generation → produce human‑like text from data. Historical Eras Symbolic (1950s‑early‑1990s) – Hand‑crafted rules, e.g., Chinese Room, Lesk algorithm. Statistical (1990s‑present) – Probabilistic models, IBM alignment models, HMM POS taggers. Neural‑Network (2010s‑present) – Word2vec embeddings, seq2seq translation, transformer models. Three Main Approaches Symbolic / Rule‑Based – Explicit linguistic rules, useful when data are scarce. Statistical / Machine‑Learning – Learn probabilities from corpora; more robust to noise. Neural / Deep Learning – End‑to‑end learned representations; capture long‑range dependencies. --- 📌 Must Remember Definition: NLP = processing natural language information by a computer. Key Milestones 1950 – Turing test (automated language interpretation/generation). 1980s – Quantitative evaluation pushes statistical shift. 1990s – IBM alignment models → statistical MT. 2010 – Mikolov’s RNN language model → Word2vec embeddings. 2010s – Transformer architectures dominate. Approach Trade‑offs Symbolic = high expert effort, low data needs. Statistical = better scalability, needs annotated corpora. Neural = highest accuracy, requires massive data & compute. Word Embeddings: Continuous vectors where cosine similarity ≈ semantic similarity. Sequence‑to‑Sequence (seq2seq): Encoder–decoder networks that map an input sequence (e.g., source sentence) directly to an output sequence (e.g., translation). --- 🔄 Key Processes Speech Recognition Pipeline Audio capture → feature extraction (e.g., MFCC) → acoustic model (often a neural net) → language model → decoded text. Text Classification Raw text → tokenization → vectorization (e.g., TF‑IDF or embeddings) → classifier (logistic regression, SVM, or neural net) → label output. Training Word2vec (skip‑gram) Input word → project to embedding → predict surrounding context words → update embeddings via gradient descent. Seq2seq Machine Translation Encoder RNN/Transformer reads source sentence → creates context vector → Decoder RNN/Transformer generates target sentence token‑by‑token, using attention to focus on relevant source parts. --- 🔍 Key Comparisons Symbolic vs. Statistical vs. Neural Data requirement: low → moderate → high. Feature engineering: hand‑crafted → learned probabilities → learned representations. Typical performance: limited → good → state‑of‑the‑art. Rule‑Based vs. Statistical POS Tagging Rule‑Based: explicit tag rules, brittle on unseen words. Statistical (HMM): learns tag transition/emission probabilities, handles unknowns better. Supervised vs. Unsupervised Learning Supervised: uses hand‑annotated labels → higher per‑sample accuracy. Unsupervised: no labels → cheaper data, lower accuracy per unit data. --- ⚠️ Common Misunderstandings “Neural nets need no data.” – They actually demand large labeled or unlabeled corpora plus compute. “Statistical = machine learning.” – Early statistical methods (e.g., IBM alignment) were rule‑free but not always “learning” in the modern ML sense. “Rule‑based systems are always inferior.” – For extremely low‑resource languages, rule‑based pipelines can outperform data‑hungry neural models. “Word2vec is a deep network.” – It is a shallow two‑layer model; depth comes later in transformer‑based embeddings. --- 🧠 Mental Models / Intuition Language ↔ Vector Space: Imagine every word as a point in a high‑dimensional room; distances capture meaning. Pipeline → End‑to‑End: Traditional NLP = assembly line (tokenizer → POS → parser …); modern NLP = a single factory where raw text goes in and the desired output pops out. Attention = Spotlight: In seq2seq, the decoder shines a spotlight on relevant encoder states, like a reader focusing on specific words when translating. --- 🚩 Exceptions & Edge Cases Low‑resource languages – Rule‑based preprocessing (tokenization, morphological analysis) may be the only viable option. Unsupervised methods – Often lag behind supervised counterparts on accuracy, but they can be the only path when annotation is impossible. Compute‑heavy models – Transformers deliver top performance but may be impractical on limited hardware; smaller RNNs or statistical models become fallback choices. --- 📍 When to Use Which Rule‑Based → No sizable corpus, domain‑specific expert knowledge, quick prototype for rare languages. Statistical (HMM, CRF, etc.) → Moderate labeled data, need robustness to noisy input (misspellings). Neural (embeddings, Transformers) → Large annotated or raw corpora, high‑accuracy demand, resources for GPU/TPU training. Unsupervised / Semi‑Supervised → Massive unlabeled web text, pre‑training stage before fine‑tuning on a small labeled set. --- 👀 Patterns to Recognize “Sequence‑to‑sequence” → Look for encoder‑decoder architecture, often paired with attention → likely translation or summarization task. “Embedding + cosine similarity” → Indicates lexical semantics or similarity‑based retrieval. “Parse tree / dependency graph” → Syntactic analysis problem. “Multi‑modal” → Presence of text + audio/video → expect multimodal fusion models. --- 🗂️ Exam Traps Distractor: “Word2vec is a deep neural network.” – It is a shallow model; depth appears later in transformer‑based embeddings. Distractor: “Statistical models cannot handle misspellings.” – Actually, probabilistic models are more robust to noisy input than rule‑based ones. Distractor: “Rule‑based NLP always outperforms neural methods.” – True only in extreme low‑data scenarios; otherwise neural approaches dominate benchmarks. Distractor: “Morphological analysis is the same as lexical semantics.” – Morphology deals with word form structure; lexical semantics concerns word meaning in context. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or