Computational biology Study Guide
Study Guide
📖 Core Concepts
Computational Biology – uses algorithms, mathematical models, and simulations to understand biological systems (cells, organisms, populations).
Bioinformatics – a sub‑field that creates tools for storing, organizing, visualizing, and analyzing biological data.
Evolutionary Computation – computer‑science techniques inspired by natural evolution; can be applied inside computational biology.
Systems Biology – studies interactions among many biological components (genes, proteins, metabolites) to reveal emergent properties.
Genomics – large‑scale study of whole genomes; includes sequence alignment, gene ontology, and 3‑D genome architecture.
Machine‑Learning Techniques – supervised (e.g., decision trees, random forests) vs. unsupervised (e.g., k‑means, k‑medoids) learning for pattern discovery in biological data.
Graph Analytics – represents biological entities as nodes and their interactions as edges; centrality measures identify “hubs” in networks.
---
📌 Must Remember
Distinction: Computational biology = theory + simulation; Bioinformatics = data‑tool development.
k‑means objective: minimize \(\sum{i=1}^{n}\|xi - \mu{c(i)}\|^2\) (assign each point to the nearest cluster mean).
k‑medoids chooses an actual data point as the cluster centre, reducing sensitivity to outliers.
Random Forest = ensemble of many decision trees; predictions are majority‑vote (classification) or average (regression).
Degree centrality = number of edges incident to a node; high degree ⇒ potential biological “hub”.
Human Genome Project (1990–2003) = flagship example of computational biology enabling whole‑genome sequencing.
Sequence alignment detects longest common subsequences → essential for identifying homologous genes/variants.
---
🔄 Key Processes
k‑means clustering
Choose k and random centroids.
Assign each point to nearest centroid.
Re‑compute centroids as mean of assigned points.
Repeat steps 2–3 until assignments stop changing.
Decision‑tree training
Start with all labeled training examples.
Pick the feature that best splits the data (e.g., Gini impurity).
Create left/right child nodes; repeat recursively.
Stop when nodes are pure or a stopping criterion is met.
Random‑forest construction
Draw B bootstrap samples from the training set.
Grow a decision tree on each sample, using a random subset of features at each split.
Aggregate predictions across trees (majority vote or mean).
Graph centrality analysis
Build a network (genes ↔ proteins ↔ metabolites).
Compute degree, betweenness, or eigenvector centrality for each node.
Rank nodes; prioritize high‑centrality genes for functional studies.
---
🔍 Key Comparisons
Computational Biology vs. Bioinformatics
Goal: theory & simulation vs. data‑tool creation.
Scope: whole‑system modeling vs. data handling & visualization.
k‑means vs. k‑medoids
Centroid: mean of points vs. actual data point.
Outlier sensitivity: high for k‑means, lower for k‑medoids.
Classification tree vs. Regression tree
Output: discrete class labels vs. continuous numeric value.
Supervised vs. Unsupervised Learning
Training data: labeled vs. unlabeled.
Typical use: prediction vs. pattern discovery.
---
⚠️ Common Misunderstandings
“Bioinformatics is just a software tool” – it also includes algorithm development and statistical analysis, not just GUI programs.
“Higher degree centrality always means biologically important” – hubs can be artifacts of noisy data; always validate experimentally.
“k‑means works for any shape of clusters” – it assumes spherical, equally sized clusters; fails on elongated or uneven groups.
“Random forest eliminates overfitting completely” – it reduces variance but can still overfit if trees are too deep or data are noisy.
---
🧠 Mental Models / Intuition
“Biological network as a city map” – nodes = landmarks (genes/proteins), edges = roads (interactions). Centrality tells you which landmarks see the most traffic.
“Clustering = sorting laundry” – k‑means groups similar items (colors) together; k‑medoids picks a real piece of clothing as the reference for each pile.
“Ensemble learning = “committee” decision” – each tree votes; the group usually makes a smarter, more robust decision than any single member.
---
🚩 Exceptions & Edge Cases
k‑means fails when clusters have different variances or non‑convex shapes.
Degree centrality can be misleading in directed or weighted graphs; consider betweenness or eigenvector centrality instead.
Random forest may struggle with very high‑dimensional sparse genomic data; dimensionality reduction (e.g., PCA) can help.
Sequence alignment: short, highly repetitive regions can produce spurious matches; use masking or specialized algorithms.
---
📍 When to Use Which
Modeling whole‑system dynamics → Systems biology approaches (ODEs, agent‑based models).
Large‑scale gene‑expression classification → Random forest or other ensemble classifiers.
Discovering unknown subpopulations in omics data → Unsupervised clustering (k‑means, hierarchical clustering).
Identifying influential genes in a pathway → Graph centrality analysis (degree/eigenvector).
Building a searchable genome database → Bioinformatics pipelines (alignment, annotation, ontology).
---
👀 Patterns to Recognize
High‑degree nodes repeatedly appearing across multiple networks → likely essential/housekeeping genes.
Clusters that correspond to known tissue types or disease states → biological relevance of unsupervised results.
Decision‑tree splits on a single SNP that separate cases vs. controls → potential biomarker.
Consistent over‑representation of a GO term in a gene set → functional enrichment.
---
🗂️ Exam Traps
Choosing “k‑means” when the data are clearly non‑spherical – the question may list k‑means as an answer; the correct choice is a density‑based method (not in outline) or k‑medoids.
Assuming “higher centrality = causative gene” – distractors often equate centrality with causality; remember validation is required.
Mixing up computational biology vs. bioinformatics definitions – exam items may swap the two; recall computational biology emphasizes modeling/simulation.
Confusing classification vs. regression trees – watch for answer choices that mention “predicting a continuous phenotype” (regression) vs. “predicting disease status” (classification).
Over‑relying on random forest for small sample sizes – the trap is to select random forest as “always best”; small n‑large p data may need regularized linear models.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or