High-level description

This directory contains the core components of the Cassiopeia project, a comprehensive framework for single-cell lineage tracing and phylogenetic analysis. The project includes tools for data preprocessing, tree reconstruction, simulation, visualization, and various analytical utilities.

What does it do?

Cassiopeia provides an end-to-end pipeline for analyzing single-cell lineage tracing experiments:

  1. Preprocessing: Converts raw sequencing data into formats suitable for phylogenetic analysis, including filtering, error correction, and allele calling.

  2. Tree Reconstruction: Implements various algorithms (greedy, distance-based, spectral, and integer linear programming) to reconstruct phylogenetic trees from mutation data.

  3. Simulation: Generates synthetic phylogenetic trees and associated data for testing and validating analysis methods.

  4. Visualization: Offers both local and cloud-based options for visualizing phylogenetic trees and associated data.

  5. Analysis Tools: Provides utilities for parameter estimation, fitness calculation, evolutionary coupling analysis, and various tree metrics.

  6. Tree Comparison: Implements methods for comparing phylogenetic trees, such as Robinson-Foulds distance and triplets correct accuracy.

Entry points

The main entry points for the Cassiopeia package are:

  1. cassiopeia/preprocess: For preprocessing raw sequencing data.
  2. cassiopeia/solver: For reconstructing phylogenetic trees from processed data.
  3. cassiopeia/simulator: For generating synthetic phylogenetic data.
  4. cassiopeia/plotting: For visualizing phylogenetic trees and associated data.
  5. cassiopeia/tools: For various analytical tools and utilities.
  6. cassiopeia/critique: For comparing and analyzing phylogenetic trees.

The cassiopeia/__init__.py file serves as the main interface, importing and exposing key functionalities from these submodules.

Key Files

  1. cassiopeia/data/CassiopeiaTree.py: Defines the core data structure for representing phylogenetic trees with associated mutation data.
  2. cassiopeia/solver/CassiopeiaSolver.py: Provides the base class for tree reconstruction algorithms.
  3. cassiopeia/simulator/TreeSimulator.py: Offers the base class for tree simulation models.
  4. cassiopeia/plotting/local.py and cassiopeia/plotting/itol_utilities.py: Implement local and cloud-based tree visualization.
  5. cassiopeia/tools/parameter_estimators.py: Contains functions for estimating key parameters from tree data.
  6. cassiopeia/critique/compare.py: Implements tree comparison algorithms.
  7. build.py: Handles the building and compilation of Cython extensions for performance-critical components.
  8. README.md: Provides an overview of the project, installation instructions, and links to documentation and tutorials.

Dependencies

Cassiopeia relies on several external libraries:

  1. numpy and pandas: For data manipulation and numerical computations.
  2. networkx: For graph operations and tree manipulations.
  3. scipy: For various scientific computing tasks.
  4. ete3: For phylogenetic tree manipulation and visualization.
  5. matplotlib and plotly: For local plotting.
  6. pysam: For handling sequencing data formats.
  7. cvxpy and gurobipy: For optimization problems in some solvers.
  8. Cython: For compiling performance-critical components.

Configuration

Cassiopeia uses various configuration methods:

  1. Preprocessing pipeline: Configured using an INI-format file specifying parameters for each preprocessing step.
  2. Solvers: Customizable through parameters passed during initialization.
  3. Simulators: Configurable via parameters set during object creation.
  4. Visualization: Customizable through function parameters for both local and cloud-based plotting.
  5. pyproject.toml: Defines project metadata, dependencies, and build settings.
  6. .readthedocs.yml: Configures the documentation build process on Read the Docs.
  7. codecov.yml: Sets coverage requirements for the project.

The project also includes a comprehensive test suite in the test directory, ensuring the reliability and accuracy of the framework across its various modules and functionalities.

In summary, Cassiopeia provides a robust and flexible framework for analyzing single-cell lineage tracing data, offering tools for every step from raw data processing to advanced phylogenetic analysis and visualization.