IIDExponentialBayesian.py
High-level description
The IIDExponentialBayesian
class in cassiopeia/tools/branch_length_estimator/IIDExponentialBayesian.py
implements a Bayesian method for estimating branch lengths in a phylogenetic tree, assuming a subsampled Birth Process for the phylogeny and independent, identically distributed (IID) exponential waiting times for mutations at each site. The estimator calculates the posterior mean branch lengths conditional on the observed tree topology and character data.
Code Structure
The IIDExponentialBayesian
class inherits from the abstract base class BranchLengthEstimator
and implements the estimate_branch_lengths
method. The core logic resides in _populate_attributes_with_cpp_implementation
, which leverages a C++ implementation (_PyInferPosteriorTimes
) for efficient computation of posterior node times and related attributes.
References
cassiopeia.data.CassiopeiaTree
: Used to represent and manipulate phylogenetic trees.._iid_exponential_bayesian._PyInferPosteriorTimes
: C++ module for efficient posterior time inference..BranchLengthEstimator.BranchLengthEstimator
: Abstract base class for branch length estimators.
Symbols
IIDExponentialBayesian
Description
This class implements the IID Exponential Bayesian branch length estimation method. It assumes a subsampled Birth Process for the phylogeny and IID exponential waiting times for mutations.
Inputs
Name | Type | Description |
---|---|---|
mutation_rate | float | The CRISPR/Cas9 mutation rate. |
birth_rate | float | The phylogeny birth rate. |
sampling_probability | float | The probability that a leaf in the ground truth tree was sampled. Must be in (0, 1]. |
discretization_level | int | How many timesteps are used to discretize time. Defaults to 600. |
Outputs
The class doesn’t directly return outputs. Instead, it modifies the input CassiopeiaTree
object in place, populating its branch lengths based on the estimated posterior mean times.
Internal Logic
- Input Validation: Ensures the input tree is valid (binary except for the root) and the sampling probability is within the acceptable range.
- Data Preprocessing: Imputes unambiguous missing states in the character matrix for easier dynamic programming.
- C++ Implementation: Calls the
_PyInferPosteriorTimes
C++ module to efficiently compute the log joint probabilities, posterior means, and posterior distributions of node times. - Branch Length Population: Uses the computed posterior mean times to populate the branch lengths of the input
CassiopeiaTree
.
Side Effects
- Modifies the input
CassiopeiaTree
object in place by populating its branch lengths.
Performance Considerations
- The computational complexity of the branch length estimation is O(discretization_level * tree.n_cell * tree.n_character).
- The C++ implementation (
_PyInferPosteriorTimes
) is used for performance optimization.
estimate_branch_lengths
Description
This method estimates the branch lengths of the provided CassiopeiaTree
using the IID Exponential Bayesian model.
Inputs
Name | Type | Description |
---|---|---|
tree | CassiopeiaTree | The input CassiopeiaTree object for which to estimate branch lengths. |
Outputs
The method doesn’t directly return outputs. Instead, it modifies the input CassiopeiaTree
object in place, populating its branch lengths.
Internal Logic
- Input Validation: Validates the input tree topology.
- Data Preprocessing: Creates a deepcopy of the input tree and imputes deducible missing states.
- Posterior Time Inference: Calls
_populate_attributes_with_cpp_implementation
to infer posterior node times using the C++ implementation. - Branch Length Population: Calls
_populate_branch_lengths
to populate the branch lengths of the original input tree using the inferred posterior means.
Side Effects
- Modifies the input
CassiopeiaTree
object in place by populating its branch lengths.
log_joints
Description
This method returns the log joint probability density of the observed tree topology, state vectors, and all possible times for a given node.
Inputs
Name | Type | Description |
---|---|---|
node | str | The internal node for which to compute the log joint probabilities. |
Outputs
Name | Type | Description |
---|---|---|
log_joints | np.array | An array of log joint probabilities for each discretized time point. |
posterior_time
Description
This method returns the posterior distribution of the time for a given node, conditional on the observed character states and tree topology.
Inputs
Name | Type | Description |
---|---|---|
node | str | The internal node for which to compute the posterior time distribution. |
Outputs
Name | Type | Description |
---|---|---|
posterior_time | np.array | An array representing the posterior time distribution for the given node. |
Dependencies
- numpy
- copy
- typing
- scipy (used in related code snippets)
- networkx (used in related code snippets)
- parameterized (used in related code snippets)
- pytest (used in related code snippets)
- unittest (used in related code snippets)
Error Handling
- Raises
ValueError
for invalid input parameters or tree topology.
TODOs
None found.