High-level description

The cassiopeia/tools/fitness_estimator directory contains implementations of fitness estimation algorithms for phylogenetic trees, specifically designed for use with the Cassiopeia library. The main components include an abstract base class FitnessEstimator and a concrete implementation LBIJungle that uses the Lineage Branching Index (LBI) method.

What does it do?

This module provides tools for estimating the fitness of nodes in phylogenetic trees. The main functionality includes:

  1. Defining an interface for fitness estimation algorithms through the abstract FitnessEstimator class.
  2. Implementing the LBI fitness estimator using the jungle package, which calculates fitness based on the branching patterns of the tree.
  3. Converting Cassiopeia trees to Newick format for compatibility with external libraries.
  4. Annotating nodes in the tree with fitness values.

The fitness estimation process helps in understanding the evolutionary dynamics of the sequences represented in the phylogenetic tree, with higher fitness values indicating potentially more successful or rapidly evolving lineages.

Entry points

The main entry points for using this module are:

  1. FitnessEstimator: An abstract base class that defines the interface for all fitness estimation algorithms in Cassiopeia.
  2. LBIJungle: A concrete implementation of the FitnessEstimator class that uses the Lineage Branching Index method for fitness estimation.

Developers can use these classes to estimate fitness for CassiopeiaTree objects, which represent phylogenetic trees in the Cassiopeia library.

Key Files

  1. _FitnessEstimator.py: Defines the abstract FitnessEstimator class and the FitnessEstimatorError exception.
  2. _lbi_jungle.py: Implements the LBIJungle class, which uses the jungle package to estimate fitness using the LBI method.
  3. __init__.py: Serves as the top-level entry point for the module, exposing the main components.

Dependencies

The module relies on several external libraries:

  1. jungle: A wrapper around Neher et al.’s original code for LBI calculations.
  2. networkx: Used for representing and manipulating tree topologies.
  3. numpy: Used for random number generation and array manipulation.
  4. ete3: For phylogenetic tree manipulation and visualization (used in the _jungle subdirectory).
  5. Bio.Phylo: For interfacing with Biopython’s phylogenetic tree representation (used in the _jungle subdirectory).
  6. scipy: For various scientific computing tasks and statistical functions (used in the _jungle subdirectory).
  7. pandas: For data manipulation and analysis (used in the _jungle subdirectory).
  8. matplotlib: For visualization of results and trees (used in the _jungle subdirectory).

Configuration

The main classes use constructor parameters and method arguments for configuration:

  • LBIJungle:

    • random_seed: Optional integer to set the random seed for reproducibility.
  • estimate_fitness method:

    • Takes a CassiopeiaTree object as input and modifies it in place by adding a ‘fitness’ attribute to each node.

Users can adjust these parameters to customize the fitness estimation process for their specific needs in evolutionary studies and population genetics research.

The _jungle subdirectory contains additional classes and functions for more advanced phylogenetic analysis, including:

  • Forest: For managing collections of phylogenetic trees.
  • Tree: For analyzing individual phylogenetic trees.
  • SFS: For calculating and analyzing Site Frequency Spectra.
  • SizeMatchedModel: For statistical modeling based on data size.

These components provide a comprehensive toolkit for in-depth analysis of evolutionary fitness and phylogenetic relationships in biological sequences.