High-level description

The cassiopeia/tools directory contains a collection of utility functions and classes for analyzing phylogenetic trees and performing various calculations related to lineage tracing experiments. This toolset provides functionality for estimating parameters, computing evolutionary metrics, and performing statistical analyses on tree structures.

What does it do?

The tools in this directory perform several key functions:

  1. Autocorrelation analysis: Computes Moran’s I statistic to measure spatial autocorrelation of numerical data associated with tree leaves.

  2. Branch length estimation: Implements Maximum Likelihood Estimation (MLE) and Bayesian approaches for estimating branch lengths in phylogenetic trees.

  3. Evolutionary coupling: Calculates how closely related different categories are based on their distribution across the tree structure.

  4. Fitness estimation: Estimates the fitness of nodes in a phylogenetic tree using methods like Lineage-Based Inference (LBI).

  5. Parameter estimation: Estimates mutation rates and missing data rates from tree structures and character matrices.

  6. Small parsimony analysis: Implements algorithms for ancestral state reconstruction and parsimony scoring.

  7. Topology analysis: Assesses topological properties of trees, such as balance and expansion, and computes metrics like cophenetic correlation.

  8. Tree metrics: Calculates various metrics on phylogenetic trees, including parsimony scores and likelihood under different evolutionary models.

These tools enable researchers to extract meaningful information from lineage tracing data, analyze evolutionary relationships, and make inferences about the underlying biological processes.

Entry points

The main entry points for developers are:

  1. autocorrelation.py: compute_morans_i function for spatial autocorrelation analysis.
  2. branch_length_estimator/: IIDExponentialMLE and IIDExponentialBayesian classes for branch length estimation.
  3. coupling.py: compute_evolutionary_coupling function for category relationship analysis.
  4. fitness_estimator/: FitnessEstimator abstract base class and LBIJungle implementation.
  5. parameter_estimators.py: Functions for estimating mutation rates and missing data rates.
  6. small_parsimony.py: Functions for ancestral state reconstruction and parsimony scoring.
  7. topology.py: Functions for computing expansion p-values and cophenetic correlation.
  8. tree_metrics.py: Functions for calculating parsimony and likelihood scores on trees.

Dependencies

The tools in this directory rely on several external libraries:

  1. numpy: For numerical computations and array operations.
  2. pandas: For data manipulation and analysis.
  3. scipy: For scientific computing and statistical functions.
  4. networkx: For graph operations and tree representation.
  5. ete3: For tree manipulation and visualization.
  6. cvxpy: Used in the MLE branch length estimator.
  7. Cython: Used for interfacing with C++ code in the Bayesian branch length estimator.
  8. tqdm: For progress bars in long-running computations.

Additionally, the tools depend on other parts of the Cassiopeia package, particularly the cassiopeia.data module for tree data structures and the cassiopeia.mixins module for custom error classes.

Configuration

Most of the tools in this directory do not require explicit configuration files. Instead, they use parameter-based configuration through function arguments and class constructors. Key configuration options include:

  1. Random seeds for reproducibility in stochastic processes.
  2. Thresholds and minimum values for various calculations (e.g., minimum clade size for expansion p-values).
  3. Options for handling missing data and ancestral state reconstruction.
  4. Choices between discrete and continuous models for certain calculations.

Users can customize the behavior of these tools by adjusting the input parameters when calling the functions or instantiating the classes.

This toolset provides a comprehensive suite of functions for analyzing phylogenetic trees and lineage tracing data, offering flexibility and power for researchers working with evolutionary data in the context of the Cassiopeia framework.