Subjects/Science/Computer and Information Science/Computer Science/Sequence alignment

Sequence alignment Study Guide

Study Guide

📖 Core Concepts Sequence Alignment – arranging DNA, RNA, or protein strings so that equivalent residues line up in columns, revealing similarity. Global vs. Local – Global aligns the entire length of each sequence (Needleman‑Wunsch). Local finds the best matching subsections (Smith‑Waterman). Scoring Scheme – combines substitution scores (from PAM or BLOSUM matrices) with gap penalties (often affine: open + extend). CIGAR String – a compact text code that records the series of alignment operations (M = match/mismatch, I = insertion, D = deletion, etc.). Multiple Sequence Alignment (MSA) – simultaneous alignment of ≥3 sequences; optimal MSA is NP‑complete, so heuristic methods (progressive, iterative) are used. Conservation – identical residues → strong functional/structural constraint; conservative substitutions → similar side‑chain properties, still indicate importance. 📌 Must Remember Global alignment = whole‑sequence, local alignment = high‑similarity region only. Affine gap penalty: penalty = gap‑open + (gap‑extend × length). Typical example: open = ‑10, extend = ‑2. PAM vs. BLOSUM: PAM models evolutionary time (e.g., PAM250), BLOSUM models observed substitutions in conserved blocks (e.g., BLOSUM62). Use BLOSUM for more divergent proteins, PAM for closer relationships. CIGAR symbols – M (match/mismatch), I (insertion), D (deletion), S (soft‑clip), H (hard‑clip). Progressive MSA workflow – build guide tree → align most similar pair → add next sequence/group according to tree. MUM – longest substring that appears exactly once in each genome and cannot be extended without a mismatch; useful for rapid anchoring. Profile/HMM – after an MSA, a position‑specific scoring matrix (PSSM) or hidden Markov model captures conserved patterns for sensitive searches. 🔄 Key Processes Needleman‑Wunsch (global DP) Initialize first row/column with cumulative gap‑open penalties. Fill matrix: score(i,j) = max[ diag + sub, up + gap, left + gap ]. Traceback from bottom‑right to top‑left to produce alignment. Smith‑Waterman (local DP) Same filling rules, but also allow zero (restart). Start traceback at the highest‑scoring cell; stop when a zero cell is reached. Affine Gap Penalty DP Maintain three matrices: M (match/mismatch), I (insertion in query), D (deletion). Update with separate open/extend costs to discourage many short gaps. Progressive MSA Compute all‑pair distances (e.g., percent identity). Build a guide tree (UPGMA/Neighbor‑Joining). Align leaf pair, then iteratively align clusters following the tree. Iterative Refinement Start with a progressive MSA. Re‑align subsets (e.g., one sequence vs. the rest) repeatedly to improve the sum‑of‑pairs score. Generating a CIGAR String (from alignment) Scan alignment column‑wise: count consecutive identical operation types, output <count><op>. 🔍 Key Comparisons Global (Needleman‑Wunsch) vs. Local (Smith‑Waterman) Goal: align entire sequences vs. find highest‑scoring subsequence. Use: full‑length homologs vs. domains/short motifs. PAM vs. BLOSUM Construction: PAM from inferred evolutionary steps; BLOSUM from observed blocks. Best for: closely related (high PAM) vs. divergent (low‑number BLOSUM). Progressive vs. Iterative MSA Progressive: fast, tree‑driven, prone to early‑error propagation. Iterative: slower, repeatedly fixes errors, higher sum‑of‑pairs score. Exact DP vs. Heuristic (BLAST/FASTA) Exact: guarantees optimal score, O(mn) time. Heuristic: uses word seeds (k‑tuples) for speed; may miss optimal alignments. ⚠️ Common Misunderstandings “A high score always means true homology.” – Scores must be evaluated for statistical significance (E‑value) and database composition. “Gaps are always penalized equally.” – Affine penalties treat opening a gap as costlier than extending one. “Conservative substitution = no functional impact.” – Even conservative changes can affect activity; they merely suggest tolerated variation. “MSA tools always give the same result.” – Different algorithms, gap settings, and guide trees can produce divergent alignments. 🧠 Mental Models / Intuition Alignment as a “road map” – Think of sequences as parallel roads; gaps are detours (insertions/deletions). The scoring system rewards staying on the same road (matches) and penalizes detours. CIGAR as a “run‑length encode” – Like compressing a bitmap: consecutive identical actions are collapsed into a count + symbol. Progressive MSA = “building a puzzle from the edges inward.” – Start with the most obvious piece pair, then add surrounding pieces guided by the picture (tree). 🚩 Exceptions & Edge Cases Semi‑global (glocal) alignment – Allows free gaps at one or both ends (useful for overlapping reads or short query vs. long reference). Highly repetitive databases – Inflated chance scores; must adjust significance calculations. Short, low‑complexity regions – Often masked before BLAST/FASTA to avoid spurious high scores. 📍 When to Use Which | Situation | Recommended Method | |-----------|--------------------| | Align full‑length orthologous proteins | Needleman‑Wunsch with BLOSUM62 | | Find conserved domain in a large protein | Smith‑Waterman (local) or BLAST with appropriate word size | | Align many (>10) sequences of moderate length | Progressive MSA (ClustalW2, T‑Coffee) | | Refine an existing MSA for highest sum‑of‑pairs | Iterative refinement (e.g., MAFFT L‑INS‑i) | | Map short reads to a reference genome | BWT‑based aligner (Bowtie, BWA) | | Detect remote homologs | Profile HMM (e.g., HMMER) | | Generate a compact alignment descriptor | Produce CIGAR string from alignment | 👀 Patterns to Recognize Diagonal streaks in dot‑matrix plots → long conserved regions (potential exons or domains). Clusters of colons (:) and periods (.) → conserved but not identical residues (conservative/semiconservative). Long runs of “M” in CIGAR → high similarity; interspersed “I/D” indicate indels. High BLOSUM score residues (e.g., W↔F) often appear in functional sites despite being different letters. 🗂️ Exam Traps Choosing PAM over BLOSUM for distant proteins – leads to under‑scoring true matches; exam may ask which matrix is appropriate for low identity. Assuming global alignment is always best – if only a domain is conserved, a local algorithm yields a higher‑quality answer. Confusing “match” in CIGAR (M) with “identical” – M includes both matches and mismatches; the actual identity must be computed separately. Neglecting affine gap penalties – using a single gap cost can artificially inflate the number of short gaps, producing biologically implausible alignments. Interpreting a high raw score without E‑value – significance depends on database size; a raw score alone is insufficient. --- Use this guide to skim before the test – each bullet is an exam‑ready fact or decision point.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or