Bioinformatics Study Guide
Study Guide
📖 Core Concepts
Bioinformatics – computational methods & software for analyzing large, complex biological data; blends biology, chemistry, physics, CS, statistics, and math.
Genome annotation – labeling DNA sequences with functional features (genes, start/stop sites, protein functions, pathways).
Comparative genomics / orthology – matching genes/features across species to infer evolutionary relationships.
High‑throughput sequencing (NGS) – technology that produces millions of short DNA reads for rapid, large‑scale sequencing.
Gene‑wide association study (GWAS) – statistical scan of common variants across many genomes to link them with traits/diseases.
Protein structure hierarchy – primary (AA sequence) → secondary → tertiary → quaternary; primary usually dictates the final fold.
Homology modeling – predicting a protein’s 3‑D structure using a known structure of a homologous sequence.
Molecular interaction network – graph of physical/functional contacts (protein‑protein, protein‑ligand, etc.) that underlies cellular systems.
Gene Ontology (GO) – a directed‑acyclic‑graph of controlled terms describing gene product functions, processes, and locations.
BioCompute Object (BCO) – JSON‑formatted, standardized record of a bio‑informatics pipeline for reproducibility and regulatory transparency.
---
📌 Must Remember
NGS → short reads → shotgun assembly → genome reconstruction.
Genome annotation levels: nucleotide (gene finding), protein (function), process (GO pathways).
Evolutionary events: point mutation, duplication, inversion, transposition, lateral transfer, insertion, deletion.
GWAS: identifies common variants; explains only a small fraction of heritability → “missing heritability”.
Rare variant analysis can capture much of the missing heritability.
Cancer mutations: driver (causative) vs passenger (neutral).
Protein expression measurement: protein microarrays & high‑throughput mass spectrometry.
Key databases: GenBank (nucleotides), UniProt (proteins), PDB (structures), KEGG/BioCyc (pathways), SRA (raw reads).
AlphaFold = deep‑learning tool that outperforms prior structure‑prediction methods.
BCO storage: JSON format; funded by FDA for regulatory transparency.
---
🔄 Key Processes
Shotgun Sequencing & Assembly
Fragment DNA → generate millions of overlapping short reads → align overlaps → assemble contigs → scaffold into full genome.
Genome Annotation Workflow
Nucleotide‑level: ab‑initio gene prediction + similarity search vs EST databases.
Protein‑level: compare predicted proteins to UniProt / domain databases → assign function.
Process‑level: map proteins to GO terms → infer pathway/physiological role.
GWAS Pipeline
Collect genotype data → quality control → test each SNP for association with phenotype → correct for multiple testing → report significant loci.
Homology Modeling Steps
Identify template with high sequence similarity → align target‑template sequences → copy backbone coordinates → remodel side‑chains → refine/validate model.
Network Construction
Gather interaction data (experiment or prediction) → represent as nodes (proteins) & edges (interactions) → analyze topology (hubs, modules).
---
🔍 Key Comparisons
NGS vs. Sanger sequencing – NGS: millions of short reads, high throughput, lower per‑base cost; Sanger: long reads, low throughput, higher cost.
Driver vs. Passenger mutations – Drivers: confer growth advantage, recurrent in cancers; Passengers: incidental, not selected.
Common vs. Rare variants (GWAS) – Common: high allele frequency, modest effect, captured by GWAS; Rare: low frequency, larger effect, require sequencing‑based studies.
Ab‑initio vs. Homology‑based gene prediction – Ab‑initio: relies on statistical signals in DNA; Homology‑based: uses similarity to known genes/proteins.
---
⚠️ Common Misunderstandings
“All GWAS hits explain disease risk.” → They explain only a small fraction; most heritability remains missing.
“Primary structure always determines final protein shape.” – True for most proteins, but exceptions exist (e.g., prions misfold).
“More sequencing depth = better assembly.” – Beyond a point, repetitive regions still cause gaps; algorithm choice matters.
“A GO term guarantees functional annotation.” – GO terms are predictions; experimental validation may be needed.
---
🧠 Mental Models / Intuition
Puzzle‑piece model for assembly: imagine a jigsaw where each short read is a piece; overlaps guide placement, but missing pieces (repeats) create ambiguity.
Evolutionary “edit script”: think of a genome as a document; mutations are edits (insert, delete, copy, move) that can be traced across species.
Network hub analogy: proteins with many connections act like airport hubs—disrupting them (mutations) often has large phenotypic impact.
---
🚩 Exceptions & Edge Cases
Protein misfolding (prions): same primary sequence can adopt alternative, pathogenic conformations.
Horizontal gene transfer: bacterial genomes acquire genes from unrelated species, breaking the simple tree‑of‑life model.
Non‑coding regulatory elements: enhancers can act megabases away; promoter‑only analysis may miss key regulation.
---
📍 When to Use Which
Choose NGS assembly vs. reference‑guided mapping: Use de‑novo assembly when no close reference exists; map to reference for variant calling in well‑studied organisms.
Select GWAS vs. rare‑variant sequencing: GWAS for common, modest‑effect variants in large cohorts; rare‑variant sequencing (exome/genome) for low‑frequency, high‑impact mutations.
Pick homology modeling vs. AlphaFold: Use homology modeling when a close structural template (>30% identity) is available and computational resources are limited; use AlphaFold for high‑accuracy predictions without a template.
Apply ab‑initio gene prediction when: no homologous proteins exist in databases (e.g., novel organisms).
---
👀 Patterns to Recognize
Repetitive read coverage spikes → likely repetitive genomic regions causing assembly collapse.
Clustered GWAS hits in same pathway → suggests a biological process underlying the trait.
High‑degree nodes in interaction network + disease‑associated mutations → potential therapeutic targets.
Conserved motifs upstream of genes → promoter elements controlling transcription.
---
🗂️ Exam Traps
Distractor: “All GWAS variants are causal.” – GWAS only flags loci; many are linked to the true causal variant.
Trap: “AlphaFold replaces all experimental structure work.” – AlphaFold predicts but does not supplant experimental validation, especially for complexes or dynamics.
Misleading choice: “Protein function can be inferred solely from primary sequence.” – Requires domain/motif analysis and sometimes structural context.
Wrong answer: “BioCompute Object stores raw sequencing reads.” – BCO stores pipeline metadata, not raw data.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or