Phylogenetics - Phylogenetic Methods and Accuracy
Understand the main phylogenetic inference methods, how taxon sampling impacts accuracy, and the statistical tools for assessing and comparing trees.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
Which tree does the Parsimony Principle prefer when explaining observed character states?
1 of 12
Summary
Methods of Phylogenetic Inference
Introduction
Phylogenetic inference is the process of reconstructing evolutionary history from biological data. Scientists use various methods to build evolutionary trees that best explain patterns of similarity and difference among organisms. Each method makes different assumptions about how evolution works and uses different criteria to evaluate which tree is "best." Understanding these methods is essential because the choice of method can affect the resulting tree topology and our conclusions about evolutionary relationships.
The Parsimony Principle
Definition: Parsimony is a fundamental principle in phylogenetics that states: the tree requiring the fewest evolutionary changes to explain the observed data is preferred.
The logic behind parsimony is intuitive: if we observe characters (traits, DNA sequences, morphological features) distributed among taxa, we want to find the evolutionary tree that explains this distribution with the minimum number of change events. For example, if a specific mutation appears in only three closely related species, it's more parsimonious to assume that mutation occurred once in their common ancestor rather than three separate times independently.
To apply maximum parsimony, you:
Consider all possible tree topologies for your set of taxa
Count the total number of character state changes required for each tree
Select the tree (or trees) with the lowest total number of changes
A key limitation: Parsimony assumes that evolution is relatively slow and that similar character states indicate recent common ancestry. However, under conditions of rapid evolution, parsimony can perform poorly—particularly due to a problem called long-branch attraction, which we'll discuss later.
Distance-Based Methods: Neighbor Joining
Distance-based methods work fundamentally differently from parsimony. Instead of counting character changes, they use overall similarity (or dissimilarity) between taxa to build trees.
The concept: Distance methods calculate a pairwise distance matrix—essentially, a table showing how different each organism is from every other organism. The more different two organisms are, the larger their distance value. The tree is then constructed by clustering organisms based on these distances.
Neighbor Joining, developed by Saitou and Nei (1987), is the most widely used distance method. Here's how it works:
Begin with all taxa as separate lineages
Identify the pair of taxa with the smallest distance (most similar to each other)
Join them together with a branch point
Calculate distances from this new node to all remaining taxa
Repeat until all taxa are connected
Neighbor joining has become popular because it doesn't require assuming that evolution proceeds at a constant rate (unlike older distance methods such as UPGMA). This makes it more applicable to real biological data where different lineages often evolve at different speeds.
Maximum Likelihood Methods
The fundamental principle: Maximum likelihood asks: "Which tree makes the observed data most probable?" Rather than assuming the simplest explanation, likelihood methods calculate the probability of seeing your actual data under different tree models and select the tree that makes your data most likely.
To use maximum likelihood, you must:
Assume a model of character change (for DNA sequences, this might specify the rates at which different types of mutations occur)
For each possible tree topology, calculate the probability of observing your data given that tree and your chosen model
Select the tree with the highest probability (the "maximum likelihood" tree)
Why this matters: Maximum likelihood has a key advantage over parsimony—it explicitly accounts for the evolutionary process. If you know that certain mutations are more common than others (for instance, transitions occur more frequently than transversions in DNA), you can incorporate this knowledge into your analysis. This makes likelihood methods more statistically sophisticated, though also more computationally intensive.
The choice of evolutionary model is critical. Different models make different assumptions about substitution rates, and using an inappropriate model can lead to incorrect tree inference.
Bayesian Inference and Markov Chain Monte Carlo
The Bayesian approach: Bayesian phylogenetics combines prior information about what trees are plausible with the likelihood of the observed data to calculate posterior probabilities—the probability of each tree being correct given your data and your prior assumptions.
Mathematically, this follows Bayes' theorem:
$$\text{Posterior Probability} = \frac{\text{Likelihood} \times \text{Prior Probability}}{\text{Probability of Data}}$$
Unlike maximum likelihood, which finds a single "best" tree, Bayesian methods generate a probability distribution across many possible trees. Trees that are more probable receive higher posterior values.
The computational challenge: With even moderate numbers of taxa, the number of possible trees becomes astronomical. Bayesian phylogenetics solves this using Markov Chain Monte Carlo (MCMC) sampling—a technique that walks through tree space, proposing changes to the current tree and accepting or rejecting these changes based on their probability. Over time, MCMC explores the landscape of reasonable trees and builds up an estimate of the posterior probability distribution.
A practical advantage: Bayesian analyses naturally provide measures of uncertainty. Rather than getting a single tree with some bootstrap support values (discussed below), you get a full distribution showing which clades are strongly supported and which are uncertain.
Phenetic Distance Methods and Overall Similarity
Phenetic approaches take a fundamentally different philosophical stance than other methods. Rather than trying to infer evolutionary history explicitly, phenetic methods assume that overall phenotypic similarity (or genetic similarity) correlates with evolutionary closeness.
Phenetic methods:
Calculate overall similarity/dissimilarity between organisms
Assume similarity reflects evolutionary relationships
Construct trees that group similar organisms together
Are computationally simple and fast
While neighbor joining (described above) is sometimes considered phenetic because it uses distance data, pure phenetic methods make fewer assumptions about the evolutionary process and focus simply on clustering based on similarity. These methods fell somewhat out of favor because similarity can be misleading—unrelated organisms might appear similar due to convergent evolution, while closely related organisms might appear different due to rapid change.
Taxon Sampling: A Critical Practical Consideration
Definition: Taxon sampling refers to which species or populations you choose to include in your phylogenetic analysis.
For most large clades, you cannot realistically sequence or examine every single species. Instead, you sample representative taxa—a manageable subset intended to capture the diversity of the group. The question is: how does this sampling choice affect your results?
The effect of poor sampling: Inadequate or biased taxon sampling can lead to serious errors in tree inference, most notably:
Long-branch attraction: This occurs when two distantly related lineages are incorrectly grouped together because:
They have accumulated many independent changes since their last common ancestor
These changes make them appear more similar to each other than to their true relatives
This is particularly problematic if intervening taxa (those that would clarify the true relationships) are missing from your sample
For example, if you sample species A, species B, and species C, but they actually evolved as (A,(B,C)), you might mistakenly infer (C,(A,B)) because B and C have undergone rapid evolution and accumulated many independent substitutions making them appear more similar.
Practical implication: Sampling more taxa, especially key transitional or intermediate forms, generally improves tree accuracy and prevents artifacts like long-branch attraction.
Assessing Tree Reliability and Support
Once you've inferred a tree, a critical question follows: how confident should you be in this result? Several statistical approaches address this.
Bootstrap Resampling
The bootstrap method, introduced by Felsenstein (1985), provides a statistical measure of how well different parts of your tree are supported by your data.
Here's how it works:
Take your original dataset (e.g., DNA sequence alignment)
Randomly resample characters (nucleotide positions) with replacement, creating a new dataset of the same size
Perform phylogenetic inference on this resampled dataset
Repeat steps 2–3 many times (often 100–1000 times)
For each node in your original tree, count how often that same grouping appears in the bootstrap replicates
The percentage of times a clade appears is its "bootstrap support value"
A bootstrap value of 95% for a particular node means that clade appeared in 95% of the resampled analyses—generally considered strong support. Values below 70% are typically considered weak support.
Why this works: Bootstrap mimics the uncertainty you'd expect from having incomplete data. By resampling characters, you create slightly different datasets that might yield slightly different trees. If a clade appears consistently across these variations, it's robust; if it disappears in many resampled datasets, it's sensitive to minor data variation.
Alternative Resampling Methods and Tree Comparison
The jackknife is similar to bootstrap but removes characters rather than resampling with replacement. Other resampling approaches provide complementary measures of support.
Consensus trees combine multiple trees into a single representation:
Strict consensus: Shows only clades that appear in all input trees
Majority-rule consensus: Shows clades appearing in more than 50% of trees
These are useful when you have many equally parsimonious trees or when comparing results across different methods
Tree distance metrics quantify how different two trees are from each other. The Robinson-Foulds metric counts the number of bipartitions that differ between two labeled trees, providing a numerical measure of topological difference.
Measuring Homoplasy
Homoplasy refers to similar character states that evolved independently rather than being inherited from a common ancestor. High homoplasy in your data makes phylogenetic inference difficult.
The consistency index and retention index measure homoplasy levels:
Consistency Index: The ratio of the minimum possible changes to the actual number of changes required. Values closer to 1.0 indicate less homoplasy
Rescaled Consistency Index: A modified version that corrects for the number of taxa
These indices help you evaluate whether your data strongly supports a single tree topology or whether multiple trees might be equally plausible.
Model Selection for Molecular Data
When analyzing DNA or protein sequences, you must select a substitution model—a mathematical description of how characters change over time. Different models make different assumptions about substitution rates and frequencies.
Information-theoretic criteria guide model selection:
Akaike Information Criterion (AIC): Balances fit to data against model complexity
Bayesian Information Criterion (BIC): Similar to AIC but penalizes complexity more heavily
These approaches help you avoid overfitting (choosing an overly complex model that fits noise rather than signal) while ensuring your model is realistic enough to capture important evolutionary processes.
The choice of substitution model can affect which tree is favored, making model selection an important step in likelihood and Bayesian analyses.
Comparing Methods: Practical Considerations
Different phylogenetic methods can produce different trees from the same dataset. Which should you use?
Parsimony is useful for morphological data and works well when homoplasy is low, but can struggle with long branches and rapid evolution
Distance methods are fast and work well with large datasets, but lose information by reducing sequences to single distance values
Likelihood and Bayesian methods are statistically rigorous and account for the evolutionary process explicitly, but are computationally demanding and require model selection
Best practice: Compare results across methods. If different approaches agree, you have stronger confidence in the result. If they disagree, investigate why—the disagreement often reveals important features of your data or limitations of particular methods.
Flashcards
Which tree does the Parsimony Principle prefer when explaining observed character states?
The tree that requires the fewest evolutionary changes.
How does the Maximum Likelihood method evaluate a possible phylogenetic tree?
By calculating the probability of the observed data given a specific model of character change.
Which components are combined in Bayesian inference to generate a posterior probability distribution of trees?
Prior probabilities and a likelihood function.
What sampling technique is used in Bayesian inference to explore the space of possible trees?
Markov chain Monte Carlo (MCMC) sampling.
What is the core assumption of phenetic approaches when constructing trees from similarity matrices?
That overall similarity approximates evolutionary relationship.
How does the bootstrap method assess the reliability of a phylogenetic tree?
By repeatedly sampling characters with replacement.
Who introduced bootstrap resampling as a statistical measure of tree stability?
Bradley Efron.
Whose 1950 formalization of cladistics established the modern framework for this field?
Willi Hennig.
What is the definition of taxon sampling in phylogenetics?
Selecting a subset of representative taxa to infer the evolutionary history of a larger clade.
What error can occur when unrelated lineages are incorrectly grouped due to shared homoplastic characters from poor sampling?
Long branch attraction.
What are three common techniques used to combine multiple trees into a single consensus representation?
Strict consensus
Majority-rule consensus
Reduced consensus
Which metric is used to quantify the topological differences between labeled trees?
The Robinson–Foulds metric.
Quiz
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 1: According to the parsimony principle in phylogenetic inference, which tree is preferred?
- The tree that requires the fewest evolutionary changes (correct)
- The tree with the highest likelihood score
- The tree that includes the most taxa
- The tree with the longest total branch length
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 2: Who formalized cladistics in 1950, providing the modern framework for phylogenetic systematics?
- Willi Hennig (correct)
- Charles Darwin
- Ernst Haeckel
- Bradley Efron
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 3: Which metric is used to quantify topological differences between two labeled phylogenetic trees?
- Robinson–Foulds metric (correct)
- Retention index
- Bootstrap proportion
- Likelihood ratio
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 4: What error can result from inadequate taxon sampling in phylogenetic analysis?
- Long branch attraction, grouping unrelated lineages (correct)
- Inflated likelihood scores for all trees
- Reduced overall tree length
- Higher bootstrap percentages for incorrect clades
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 5: Which resampling method provides an alternative to bootstrap by repeatedly omitting a random subset of characters?
- Jackknife (correct)
- Monte Carlo sampling
- Permutation test
- Cross‑validation
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 6: Bootstrap resampling for assessing tree stability was introduced by which statistician?
- Bradley Efron (correct)
- Ronald Fisher
- Karl Pearson
- John Tukey
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 7: Phenetic methods for phylogenetic inference assume that overall similarity between taxa reflects what?
- their evolutionary relationship (correct)
- the number of genes sequenced
- the geographic proximity of taxa
- the age of the fossil record
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 8: Who introduced the maximum‑likelihood approach for DNA sequence data in phylogenetics, and in what year?
- Joe Felsenstein in 1981 (correct)
- James Watson in 1970
- Murray Gell‑Mann in 1985
- Richard Dawkins in 1990
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 9: Increasing taxon sampling in a phylogenetic study generally leads to which of the following outcomes?
- More accurate inference of relationships (correct)
- Fewer required molecular characters
- Guarantee of monophyly for all groups
- Elimination of the need for outgroup taxa
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 10: In a strict consensus tree, a clade is included only if it appears in what proportion of the input trees?
- All of the trees (correct)
- At least 75% of the trees
- At least 50% of the trees
- At most 25% of the trees
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 11: What type of data does the neighbor‑joining method primarily require to construct a phylogenetic tree?
- A matrix of pairwise genetic distances (correct)
- Individual nucleotide sequences
- Character state matrices for parsimony
- Prior probability distributions
Phylogenetics - Phylogenetic Methods and Accuracy Quiz Question 12: Which criterion does the maximum parsimony method use to choose among competing phylogenetic trees?
- It selects the tree requiring the fewest evolutionary changes (correct)
- It selects the tree with the highest likelihood under a substitution model
- It selects the tree with the smallest total branch length
- It selects the tree that maximizes an information‑theoretic score
According to the parsimony principle in phylogenetic inference, which tree is preferred?
1 of 12
Key Concepts
Phylogenetic Inference Methods
Maximum likelihood (phylogenetics)
Bayesian phylogenetic inference
Markov chain Monte Carlo
Model selection (phylogenetics)
Tree Construction and Evaluation
Parsimony (phylogenetics)
Neighbor joining
Bootstrap (phylogenetics)
Robinson–Foulds metric
Consensus tree
Phylogenetic Errors
Long branch attraction
Definitions
Parsimony (phylogenetics)
A principle that selects the tree requiring the fewest evolutionary changes to explain observed character states.
Maximum likelihood (phylogenetics)
A method that evaluates trees by calculating the probability of the observed data under a specific model of character change.
Bayesian phylogenetic inference
An approach that combines prior probabilities with a likelihood function to produce a posterior distribution of trees.
Markov chain Monte Carlo
A computational technique used in Bayesian phylogenetics to sample from the posterior distribution of trees.
Neighbor joining
A distance‑based algorithm that builds a phylogenetic tree by iteratively joining taxa with the smallest pairwise distance.
Bootstrap (phylogenetics)
A resampling method that assesses tree reliability by repeatedly sampling characters with replacement.
Long branch attraction
A systematic error where rapidly evolving, unrelated lineages are incorrectly grouped together due to convergent characters.
Robinson–Foulds metric
A distance measure that quantifies topological differences between two labeled phylogenetic trees.
Consensus tree
A summary tree that combines multiple phylogenetic trees into a single representation, such as strict or majority‑rule consensus.
Model selection (phylogenetics)
The process of choosing the best-fit substitution model for molecular data using information‑theoretic criteria.