Subjects/Science/Computer and Information Science/Computer Science/Bioinformatics

Bioinformatics Study Guide

Study Guide

📖 Core Concepts Bioinformatics – computational methods & software for analyzing large, complex biological data; blends biology, chemistry, physics, CS, statistics, and math. Genome annotation – labeling DNA sequences with functional features (genes, start/stop sites, protein functions, pathways). Comparative genomics / orthology – matching genes/features across species to infer evolutionary relationships. High‑throughput sequencing (NGS) – technology that produces millions of short DNA reads for rapid, large‑scale sequencing. Gene‑wide association study (GWAS) – statistical scan of common variants across many genomes to link them with traits/diseases. Protein structure hierarchy – primary (AA sequence) → secondary → tertiary → quaternary; primary usually dictates the final fold. Homology modeling – predicting a protein’s 3‑D structure using a known structure of a homologous sequence. Molecular interaction network – graph of physical/functional contacts (protein‑protein, protein‑ligand, etc.) that underlies cellular systems. Gene Ontology (GO) – a directed‑acyclic‑graph of controlled terms describing gene product functions, processes, and locations. BioCompute Object (BCO) – JSON‑formatted, standardized record of a bio‑informatics pipeline for reproducibility and regulatory transparency. --- 📌 Must Remember NGS → short reads → shotgun assembly → genome reconstruction. Genome annotation levels: nucleotide (gene finding), protein (function), process (GO pathways). Evolutionary events: point mutation, duplication, inversion, transposition, lateral transfer, insertion, deletion. GWAS: identifies common variants; explains only a small fraction of heritability → “missing heritability”. Rare variant analysis can capture much of the missing heritability. Cancer mutations: driver (causative) vs passenger (neutral). Protein expression measurement: protein microarrays & high‑throughput mass spectrometry. Key databases: GenBank (nucleotides), UniProt (proteins), PDB (structures), KEGG/BioCyc (pathways), SRA (raw reads). AlphaFold = deep‑learning tool that outperforms prior structure‑prediction methods. BCO storage: JSON format; funded by FDA for regulatory transparency. --- 🔄 Key Processes Shotgun Sequencing & Assembly Fragment DNA → generate millions of overlapping short reads → align overlaps → assemble contigs → scaffold into full genome. Genome Annotation Workflow Nucleotide‑level: ab‑initio gene prediction + similarity search vs EST databases. Protein‑level: compare predicted proteins to UniProt / domain databases → assign function. Process‑level: map proteins to GO terms → infer pathway/physiological role. GWAS Pipeline Collect genotype data → quality control → test each SNP for association with phenotype → correct for multiple testing → report significant loci. Homology Modeling Steps Identify template with high sequence similarity → align target‑template sequences → copy backbone coordinates → remodel side‑chains → refine/validate model. Network Construction Gather interaction data (experiment or prediction) → represent as nodes (proteins) & edges (interactions) → analyze topology (hubs, modules). --- 🔍 Key Comparisons NGS vs. Sanger sequencing – NGS: millions of short reads, high throughput, lower per‑base cost; Sanger: long reads, low throughput, higher cost. Driver vs. Passenger mutations – Drivers: confer growth advantage, recurrent in cancers; Passengers: incidental, not selected. Common vs. Rare variants (GWAS) – Common: high allele frequency, modest effect, captured by GWAS; Rare: low frequency, larger effect, require sequencing‑based studies. Ab‑initio vs. Homology‑based gene prediction – Ab‑initio: relies on statistical signals in DNA; Homology‑based: uses similarity to known genes/proteins. --- ⚠️ Common Misunderstandings “All GWAS hits explain disease risk.” → They explain only a small fraction; most heritability remains missing. “Primary structure always determines final protein shape.” – True for most proteins, but exceptions exist (e.g., prions misfold). “More sequencing depth = better assembly.” – Beyond a point, repetitive regions still cause gaps; algorithm choice matters. “A GO term guarantees functional annotation.” – GO terms are predictions; experimental validation may be needed. --- 🧠 Mental Models / Intuition Puzzle‑piece model for assembly: imagine a jigsaw where each short read is a piece; overlaps guide placement, but missing pieces (repeats) create ambiguity. Evolutionary “edit script”: think of a genome as a document; mutations are edits (insert, delete, copy, move) that can be traced across species. Network hub analogy: proteins with many connections act like airport hubs—disrupting them (mutations) often has large phenotypic impact. --- 🚩 Exceptions & Edge Cases Protein misfolding (prions): same primary sequence can adopt alternative, pathogenic conformations. Horizontal gene transfer: bacterial genomes acquire genes from unrelated species, breaking the simple tree‑of‑life model. Non‑coding regulatory elements: enhancers can act megabases away; promoter‑only analysis may miss key regulation. --- 📍 When to Use Which Choose NGS assembly vs. reference‑guided mapping: Use de‑novo assembly when no close reference exists; map to reference for variant calling in well‑studied organisms. Select GWAS vs. rare‑variant sequencing: GWAS for common, modest‑effect variants in large cohorts; rare‑variant sequencing (exome/genome) for low‑frequency, high‑impact mutations. Pick homology modeling vs. AlphaFold: Use homology modeling when a close structural template (>30% identity) is available and computational resources are limited; use AlphaFold for high‑accuracy predictions without a template. Apply ab‑initio gene prediction when: no homologous proteins exist in databases (e.g., novel organisms). --- 👀 Patterns to Recognize Repetitive read coverage spikes → likely repetitive genomic regions causing assembly collapse. Clustered GWAS hits in same pathway → suggests a biological process underlying the trait. High‑degree nodes in interaction network + disease‑associated mutations → potential therapeutic targets. Conserved motifs upstream of genes → promoter elements controlling transcription. --- 🗂️ Exam Traps Distractor: “All GWAS variants are causal.” – GWAS only flags loci; many are linked to the true causal variant. Trap: “AlphaFold replaces all experimental structure work.” – AlphaFold predicts but does not supplant experimental validation, especially for complexes or dynamics. Misleading choice: “Protein function can be inferred solely from primary sequence.” – Requires domain/motif analysis and sometimes structural context. Wrong answer: “BioCompute Object stores raw sequencing reads.” – BCO stores pipeline metadata, not raw data. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or