High-level description

The IIDExponentialBayesian class in cassiopeia/tools/branch_length_estimator/IIDExponentialBayesian.py implements a Bayesian method for estimating branch lengths in a phylogenetic tree, assuming a subsampled Birth Process for the phylogeny and independent, identically distributed (IID) exponential waiting times for mutations at each site. The estimator calculates the posterior mean branch lengths conditional on the observed tree topology and character data.

Code Structure

The IIDExponentialBayesian class inherits from the abstract base class BranchLengthEstimator and implements the estimate_branch_lengths method. The core logic resides in _populate_attributes_with_cpp_implementation, which leverages a C++ implementation (_PyInferPosteriorTimes) for efficient computation of posterior node times and related attributes.

References

  • cassiopeia.data.CassiopeiaTree: Used to represent and manipulate phylogenetic trees.
  • ._iid_exponential_bayesian._PyInferPosteriorTimes: C++ module for efficient posterior time inference.
  • .BranchLengthEstimator.BranchLengthEstimator: Abstract base class for branch length estimators.

Symbols

IIDExponentialBayesian

Description

This class implements the IID Exponential Bayesian branch length estimation method. It assumes a subsampled Birth Process for the phylogeny and IID exponential waiting times for mutations.

Inputs

NameTypeDescription
mutation_ratefloatThe CRISPR/Cas9 mutation rate.
birth_ratefloatThe phylogeny birth rate.
sampling_probabilityfloatThe probability that a leaf in the ground truth tree was sampled. Must be in (0, 1].
discretization_levelintHow many timesteps are used to discretize time. Defaults to 600.

Outputs

The class doesn’t directly return outputs. Instead, it modifies the input CassiopeiaTree object in place, populating its branch lengths based on the estimated posterior mean times.

Internal Logic

  1. Input Validation: Ensures the input tree is valid (binary except for the root) and the sampling probability is within the acceptable range.
  2. Data Preprocessing: Imputes unambiguous missing states in the character matrix for easier dynamic programming.
  3. C++ Implementation: Calls the _PyInferPosteriorTimes C++ module to efficiently compute the log joint probabilities, posterior means, and posterior distributions of node times.
  4. Branch Length Population: Uses the computed posterior mean times to populate the branch lengths of the input CassiopeiaTree.

Side Effects

  • Modifies the input CassiopeiaTree object in place by populating its branch lengths.

Performance Considerations

  • The computational complexity of the branch length estimation is O(discretization_level * tree.n_cell * tree.n_character).
  • The C++ implementation (_PyInferPosteriorTimes) is used for performance optimization.

estimate_branch_lengths

Description

This method estimates the branch lengths of the provided CassiopeiaTree using the IID Exponential Bayesian model.

Inputs

NameTypeDescription
treeCassiopeiaTreeThe input CassiopeiaTree object for which to estimate branch lengths.

Outputs

The method doesn’t directly return outputs. Instead, it modifies the input CassiopeiaTree object in place, populating its branch lengths.

Internal Logic

  1. Input Validation: Validates the input tree topology.
  2. Data Preprocessing: Creates a deepcopy of the input tree and imputes deducible missing states.
  3. Posterior Time Inference: Calls _populate_attributes_with_cpp_implementation to infer posterior node times using the C++ implementation.
  4. Branch Length Population: Calls _populate_branch_lengths to populate the branch lengths of the original input tree using the inferred posterior means.

Side Effects

  • Modifies the input CassiopeiaTree object in place by populating its branch lengths.

log_joints

Description

This method returns the log joint probability density of the observed tree topology, state vectors, and all possible times for a given node.

Inputs

NameTypeDescription
nodestrThe internal node for which to compute the log joint probabilities.

Outputs

NameTypeDescription
log_jointsnp.arrayAn array of log joint probabilities for each discretized time point.

posterior_time

Description

This method returns the posterior distribution of the time for a given node, conditional on the observed character states and tree topology.

Inputs

NameTypeDescription
nodestrThe internal node for which to compute the posterior time distribution.

Outputs

NameTypeDescription
posterior_timenp.arrayAn array representing the posterior time distribution for the given node.

Dependencies

  • numpy
  • copy
  • typing
  • scipy (used in related code snippets)
  • networkx (used in related code snippets)
  • parameterized (used in related code snippets)
  • pytest (used in related code snippets)
  • unittest (used in related code snippets)

Error Handling

  • Raises ValueError for invalid input parameters or tree topology.

TODOs

None found.