Sequence alignment Study Guide
Study Guide
📖 Core Concepts
Sequence Alignment – arranging DNA, RNA, or protein strings so that equivalent residues line up in columns, revealing similarity.
Global vs. Local – Global aligns the entire length of each sequence (Needleman‑Wunsch). Local finds the best matching subsections (Smith‑Waterman).
Scoring Scheme – combines substitution scores (from PAM or BLOSUM matrices) with gap penalties (often affine: open + extend).
CIGAR String – a compact text code that records the series of alignment operations (M = match/mismatch, I = insertion, D = deletion, etc.).
Multiple Sequence Alignment (MSA) – simultaneous alignment of ≥3 sequences; optimal MSA is NP‑complete, so heuristic methods (progressive, iterative) are used.
Conservation – identical residues → strong functional/structural constraint; conservative substitutions → similar side‑chain properties, still indicate importance.
📌 Must Remember
Global alignment = whole‑sequence, local alignment = high‑similarity region only.
Affine gap penalty: penalty = gap‑open + (gap‑extend × length). Typical example: open = ‑10, extend = ‑2.
PAM vs. BLOSUM: PAM models evolutionary time (e.g., PAM250), BLOSUM models observed substitutions in conserved blocks (e.g., BLOSUM62). Use BLOSUM for more divergent proteins, PAM for closer relationships.
CIGAR symbols – M (match/mismatch), I (insertion), D (deletion), S (soft‑clip), H (hard‑clip).
Progressive MSA workflow – build guide tree → align most similar pair → add next sequence/group according to tree.
MUM – longest substring that appears exactly once in each genome and cannot be extended without a mismatch; useful for rapid anchoring.
Profile/HMM – after an MSA, a position‑specific scoring matrix (PSSM) or hidden Markov model captures conserved patterns for sensitive searches.
🔄 Key Processes
Needleman‑Wunsch (global DP)
Initialize first row/column with cumulative gap‑open penalties.
Fill matrix: score(i,j) = max[ diag + sub, up + gap, left + gap ].
Traceback from bottom‑right to top‑left to produce alignment.
Smith‑Waterman (local DP)
Same filling rules, but also allow zero (restart).
Start traceback at the highest‑scoring cell; stop when a zero cell is reached.
Affine Gap Penalty DP
Maintain three matrices: M (match/mismatch), I (insertion in query), D (deletion).
Update with separate open/extend costs to discourage many short gaps.
Progressive MSA
Compute all‑pair distances (e.g., percent identity).
Build a guide tree (UPGMA/Neighbor‑Joining).
Align leaf pair, then iteratively align clusters following the tree.
Iterative Refinement
Start with a progressive MSA.
Re‑align subsets (e.g., one sequence vs. the rest) repeatedly to improve the sum‑of‑pairs score.
Generating a CIGAR String (from alignment)
Scan alignment column‑wise: count consecutive identical operation types, output <count><op>.
🔍 Key Comparisons
Global (Needleman‑Wunsch) vs. Local (Smith‑Waterman)
Goal: align entire sequences vs. find highest‑scoring subsequence.
Use: full‑length homologs vs. domains/short motifs.
PAM vs. BLOSUM
Construction: PAM from inferred evolutionary steps; BLOSUM from observed blocks.
Best for: closely related (high PAM) vs. divergent (low‑number BLOSUM).
Progressive vs. Iterative MSA
Progressive: fast, tree‑driven, prone to early‑error propagation.
Iterative: slower, repeatedly fixes errors, higher sum‑of‑pairs score.
Exact DP vs. Heuristic (BLAST/FASTA)
Exact: guarantees optimal score, O(mn) time.
Heuristic: uses word seeds (k‑tuples) for speed; may miss optimal alignments.
⚠️ Common Misunderstandings
“A high score always means true homology.” – Scores must be evaluated for statistical significance (E‑value) and database composition.
“Gaps are always penalized equally.” – Affine penalties treat opening a gap as costlier than extending one.
“Conservative substitution = no functional impact.” – Even conservative changes can affect activity; they merely suggest tolerated variation.
“MSA tools always give the same result.” – Different algorithms, gap settings, and guide trees can produce divergent alignments.
🧠 Mental Models / Intuition
Alignment as a “road map” – Think of sequences as parallel roads; gaps are detours (insertions/deletions). The scoring system rewards staying on the same road (matches) and penalizes detours.
CIGAR as a “run‑length encode” – Like compressing a bitmap: consecutive identical actions are collapsed into a count + symbol.
Progressive MSA = “building a puzzle from the edges inward.” – Start with the most obvious piece pair, then add surrounding pieces guided by the picture (tree).
🚩 Exceptions & Edge Cases
Semi‑global (glocal) alignment – Allows free gaps at one or both ends (useful for overlapping reads or short query vs. long reference).
Highly repetitive databases – Inflated chance scores; must adjust significance calculations.
Short, low‑complexity regions – Often masked before BLAST/FASTA to avoid spurious high scores.
📍 When to Use Which
| Situation | Recommended Method |
|-----------|--------------------|
| Align full‑length orthologous proteins | Needleman‑Wunsch with BLOSUM62 |
| Find conserved domain in a large protein | Smith‑Waterman (local) or BLAST with appropriate word size |
| Align many (>10) sequences of moderate length | Progressive MSA (ClustalW2, T‑Coffee) |
| Refine an existing MSA for highest sum‑of‑pairs | Iterative refinement (e.g., MAFFT L‑INS‑i) |
| Map short reads to a reference genome | BWT‑based aligner (Bowtie, BWA) |
| Detect remote homologs | Profile HMM (e.g., HMMER) |
| Generate a compact alignment descriptor | Produce CIGAR string from alignment |
👀 Patterns to Recognize
Diagonal streaks in dot‑matrix plots → long conserved regions (potential exons or domains).
Clusters of colons (:) and periods (.) → conserved but not identical residues (conservative/semiconservative).
Long runs of “M” in CIGAR → high similarity; interspersed “I/D” indicate indels.
High BLOSUM score residues (e.g., W↔F) often appear in functional sites despite being different letters.
🗂️ Exam Traps
Choosing PAM over BLOSUM for distant proteins – leads to under‑scoring true matches; exam may ask which matrix is appropriate for low identity.
Assuming global alignment is always best – if only a domain is conserved, a local algorithm yields a higher‑quality answer.
Confusing “match” in CIGAR (M) with “identical” – M includes both matches and mismatches; the actual identity must be computed separately.
Neglecting affine gap penalties – using a single gap cost can artificially inflate the number of short gaps, producing biologically implausible alignments.
Interpreting a high raw score without E‑value – significance depends on database size; a raw score alone is insufficient.
---
Use this guide to skim before the test – each bullet is an exam‑ready fact or decision point.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or