Subjects/Languages/Language Studies/Linguistics/Corpus linguistics

Corpus linguistics Study Guide

Study Guide

📖 Core Concepts Corpus Linguistics – the scientific study of language through large, authentic collections of text (corpora). Corpus – a balanced, often stratified, machine‑readable set of real‑world speech or writing that reflects a specific linguistic variety. Natural‑Context Emphasis – corpora are gathered with minimal experimental interference, preserving how language is actually used. Quantitative Analysis – statistical techniques applied to corpora to test hypotheses that are hard to assess qualitatively. Text‑Corpus Method – derives abstract linguistic rules directly from the body of natural‑language texts. 3A Perspective – the workflow of Annotation → Abstraction → Analysis that structures corpus research. 📌 Must Remember First research‑oriented corpus: Brown Corpus (1960s). First modern, language‑wide corpus: Survey of English Usage (Quirk, 1960). First dictionary built with corpus data: American Heritage Dictionary (1969). Major English corpora: British National Corpus (BNC) – 100 M words, 1990s British English. Corpus of Contemporary American English (COCA) – >400 M words, 1990‑present American English. International Corpus of English (ICE) – multilingual, multi‑variety. Typical annotation: part‑of‑speech (POS) tagging; most lexical corpora are POS‑tagged. Key 3A steps: Annotation – add markup (POS, parsing, etc.). Abstraction – map annotation to theoretical constructs or searchable patterns. Analysis – statistical probing, rule optimisation, knowledge discovery. 🔄 Key Processes Building a Corpus Define target variety → collect authentic texts → digitise → (optional) balance/stratify → store in machine‑readable format. Annotation Pipeline Raw text → automatic POS tagger → manual correction (if needed) → add structural markup (sentence, speaker, genre). Abstraction to Research Questions Choose linguistic feature → formulate search/query language → retrieve instances → map to theoretical model. Statistical Analysis Frequency counts → collocation measures (e.g., MI, t‑score) → hypothesis testing (χ², t‑test) → generalisation to the whole language. 🔍 Key Comparisons Brown Corpus vs. BNC – Brown: 1 M words, American English, 1960s, manually compiled. BNC: 100 M words, British English, 1990s, largely automated. Annotated vs. Unannotated Corpora – Annotated: POS‑tagged, searchable by grammatical categories, supports complex queries. Unannotated: plain text only; requires on‑the‑fly annotation or lexical searches. Domain‑Specific vs. General‑Purpose Corpora – Legal corpus: focuses on statutory language, terminology, and argument structure. General corpus (e.g., COCA): broad genre coverage, suitable for everyday language patterns. ⚠️ Common Misunderstandings “Corpus = Dictionary” – a corpus is raw data; dictionaries are products that may use corpora but are not the same. “More data = better results” – quality (balance, representativeness) matters as much as size; a biased corpus yields biased conclusions. “Annotation is optional” – without annotation, many grammatical or syntactic analyses are impossible or extremely labor‑intensive. “Statistical significance = linguistic importance” – a statistically significant frequency may be pragmatically trivial; always interpret in context. 🧠 Mental Models / Intuition “Library Analogy” – think of a corpus as a library of real‑world language; annotation adds a detailed index (POS, syntax) that lets you locate patterns quickly. “3A Assembly Line” – imagine raw metal (text) → Annotation (cutting, shaping) → Abstraction (blueprints) → Analysis (quality testing). Each stage depends on the previous one. 🚩 Exceptions & Edge Cases Sign‑Language Corpora – rely on video data; annotation includes gestural features, not just textual tags. Historical Corpora – older texts may lack modern orthography, requiring custom tokenisation and POS models. Highly Specialized Domains (e.g., legal) – generic POS taggers often mis‑tag domain‑specific terminology; custom tagsets may be needed. 📍 When to Use Which Choose a General Corpus (BNC/COCA) when you need broad, representative language statistics across genres. Pick a Domain‑Specific Corpus (legal, translation) when research questions target terminology, discourse structure, or genre‑specific patterns. Use an Annotated Corpus for syntactic, morphological, or POS‑based investigations; resort to lexical search on unannotated texts only for pure frequency or keyword‑in‑context (KWIC) work. Apply the 3A workflow when you have a clear theoretical model to test; skip abstraction if the research question is purely descriptive (e.g., raw frequency list). 👀 Patterns to Recognize Frequency‑Collocation Pattern – high‑frequency words often form predictable collocations (e.g., “make a decision”). Spotting these guides hypothesis formation. Genre‑Specific Lexical Sets – legal corpora show over‑use of modal verbs (“shall,” “must”); teaching corpora emphasize didactic language (“you should”). Annotation Gaps – systematic tagging errors (e.g., proper nouns mis‑tagged as nouns) often appear in automatically tagged corpora; recognize and correct before analysis. 🗂️ Exam Traps Distractor: “The Brown Corpus is the largest English corpus.” – False; it is tiny (1 M words) compared to BNC/COCA. Misleading Choice: “All corpora are automatically compiled.” – Incorrect; early corpora (e.g., Brown) were manually assembled. Trap: “POS tagging is unnecessary for frequency counts.” – While raw counts are possible, many linguistic questions (e.g., verb‑type distribution) require POS information. Near‑Miss: “The International Corpus of English contains only British English data.” – Wrong; ICE is multilingual and includes many English varieties. --- Use this guide for a rapid review before your exam—focus on the bolded keywords, the 3A workflow, and the key corpora distinctions.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or