Corpus linguistics Study Guide
Study Guide
📖 Core Concepts
Corpus Linguistics – the scientific study of language through large, authentic collections of text (corpora).
Corpus – a balanced, often stratified, machine‑readable set of real‑world speech or writing that reflects a specific linguistic variety.
Natural‑Context Emphasis – corpora are gathered with minimal experimental interference, preserving how language is actually used.
Quantitative Analysis – statistical techniques applied to corpora to test hypotheses that are hard to assess qualitatively.
Text‑Corpus Method – derives abstract linguistic rules directly from the body of natural‑language texts.
3A Perspective – the workflow of Annotation → Abstraction → Analysis that structures corpus research.
📌 Must Remember
First research‑oriented corpus: Brown Corpus (1960s).
First modern, language‑wide corpus: Survey of English Usage (Quirk, 1960).
First dictionary built with corpus data: American Heritage Dictionary (1969).
Major English corpora:
British National Corpus (BNC) – 100 M words, 1990s British English.
Corpus of Contemporary American English (COCA) – >400 M words, 1990‑present American English.
International Corpus of English (ICE) – multilingual, multi‑variety.
Typical annotation: part‑of‑speech (POS) tagging; most lexical corpora are POS‑tagged.
Key 3A steps:
Annotation – add markup (POS, parsing, etc.).
Abstraction – map annotation to theoretical constructs or searchable patterns.
Analysis – statistical probing, rule optimisation, knowledge discovery.
🔄 Key Processes
Building a Corpus
Define target variety → collect authentic texts → digitise → (optional) balance/stratify → store in machine‑readable format.
Annotation Pipeline
Raw text → automatic POS tagger → manual correction (if needed) → add structural markup (sentence, speaker, genre).
Abstraction to Research Questions
Choose linguistic feature → formulate search/query language → retrieve instances → map to theoretical model.
Statistical Analysis
Frequency counts → collocation measures (e.g., MI, t‑score) → hypothesis testing (χ², t‑test) → generalisation to the whole language.
🔍 Key Comparisons
Brown Corpus vs. BNC –
Brown: 1 M words, American English, 1960s, manually compiled.
BNC: 100 M words, British English, 1990s, largely automated.
Annotated vs. Unannotated Corpora –
Annotated: POS‑tagged, searchable by grammatical categories, supports complex queries.
Unannotated: plain text only; requires on‑the‑fly annotation or lexical searches.
Domain‑Specific vs. General‑Purpose Corpora –
Legal corpus: focuses on statutory language, terminology, and argument structure.
General corpus (e.g., COCA): broad genre coverage, suitable for everyday language patterns.
⚠️ Common Misunderstandings
“Corpus = Dictionary” – a corpus is raw data; dictionaries are products that may use corpora but are not the same.
“More data = better results” – quality (balance, representativeness) matters as much as size; a biased corpus yields biased conclusions.
“Annotation is optional” – without annotation, many grammatical or syntactic analyses are impossible or extremely labor‑intensive.
“Statistical significance = linguistic importance” – a statistically significant frequency may be pragmatically trivial; always interpret in context.
🧠 Mental Models / Intuition
“Library Analogy” – think of a corpus as a library of real‑world language; annotation adds a detailed index (POS, syntax) that lets you locate patterns quickly.
“3A Assembly Line” – imagine raw metal (text) → Annotation (cutting, shaping) → Abstraction (blueprints) → Analysis (quality testing). Each stage depends on the previous one.
🚩 Exceptions & Edge Cases
Sign‑Language Corpora – rely on video data; annotation includes gestural features, not just textual tags.
Historical Corpora – older texts may lack modern orthography, requiring custom tokenisation and POS models.
Highly Specialized Domains (e.g., legal) – generic POS taggers often mis‑tag domain‑specific terminology; custom tagsets may be needed.
📍 When to Use Which
Choose a General Corpus (BNC/COCA) when you need broad, representative language statistics across genres.
Pick a Domain‑Specific Corpus (legal, translation) when research questions target terminology, discourse structure, or genre‑specific patterns.
Use an Annotated Corpus for syntactic, morphological, or POS‑based investigations; resort to lexical search on unannotated texts only for pure frequency or keyword‑in‑context (KWIC) work.
Apply the 3A workflow when you have a clear theoretical model to test; skip abstraction if the research question is purely descriptive (e.g., raw frequency list).
👀 Patterns to Recognize
Frequency‑Collocation Pattern – high‑frequency words often form predictable collocations (e.g., “make a decision”). Spotting these guides hypothesis formation.
Genre‑Specific Lexical Sets – legal corpora show over‑use of modal verbs (“shall,” “must”); teaching corpora emphasize didactic language (“you should”).
Annotation Gaps – systematic tagging errors (e.g., proper nouns mis‑tagged as nouns) often appear in automatically tagged corpora; recognize and correct before analysis.
🗂️ Exam Traps
Distractor: “The Brown Corpus is the largest English corpus.” – False; it is tiny (1 M words) compared to BNC/COCA.
Misleading Choice: “All corpora are automatically compiled.” – Incorrect; early corpora (e.g., Brown) were manually assembled.
Trap: “POS tagging is unnecessary for frequency counts.” – While raw counts are possible, many linguistic questions (e.g., verb‑type distribution) require POS information.
Near‑Miss: “The International Corpus of English contains only British English data.” – Wrong; ICE is multilingual and includes many English varieties.
---
Use this guide for a rapid review before your exam—focus on the bolded keywords, the 3A workflow, and the key corpora distinctions.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or