High-level description

The cassiopeia/data directory houses the core data structures and utility functions for representing and manipulating lineage tracing data in the Cassiopeia framework. This includes the fundamental CassiopeiaTree class, which encapsulates a phylogenetic tree with associated character matrices, metadata, and methods for tree manipulation and analysis. Additionally, this directory provides utility functions for tasks like data conversion, dissimilarity computation, bootstrapping, and tree traversal.

What does it do?

This directory provides the building blocks for working with lineage tracing data in Cassiopeia. It allows researchers to:

  1. Represent phylogenetic trees: The CassiopeiaTree class stores the tree topology, character matrix (mutations observed in cells), and metadata associated with cells and characters.
  2. Manipulate tree data: The CassiopeiaTree class offers methods for modifying the tree structure, reconstructing ancestral states, and accessing tree properties.
  3. Compute dissimilarity: Utility functions calculate pairwise dissimilarities between cells based on their character states, enabling distance-based analyses.
  4. Bootstrap data: Functions like sample_bootstrap_character_matrices generate bootstrapped versions of character matrices for statistical analysis.
  5. Convert data formats: Utilities handle the conversion between different data representations, such as from allele tables to character matrices.

Entry points

The main entry point for users is the CassiopeiaTree class (cassiopeia/data/CassiopeiaTree.py), which serves as the primary interface for interacting with lineage tracing data. The utility functions in cassiopeia/data/utilities.py are typically called by the CassiopeiaTree methods or used directly for specific data manipulation tasks.

Key Files

cassiopeia/data/CassiopeiaTree.py

This file defines the CassiopeiaTree class, the fundamental data structure in Cassiopeia. It stores and manages:

  • Tree topology (as a Networkx DiGraph)
  • Character matrix (mutations observed in each cell)
  • Cell and character metadata
  • Dissimilarity map (pairwise distances between cells)

The class provides methods for:

  • Accessing and modifying tree components
  • Reconstructing ancestral character states
  • Computing dissimilarity matrices
  • Exporting the tree in various formats

cassiopeia/data/utilities.py

This file contains a collection of utility functions for:

  • Converting between different data formats (e.g., allele tables to character matrices)
  • Computing dissimilarity maps using various distance metrics
  • Bootstrapping character matrices
  • Performing tree traversals and manipulations
  • Calculating phylogenetic distances and weights

These functions are used internally by CassiopeiaTree methods and can also be called directly for specific data manipulation tasks.

cassiopeia/data/Layers.py

This file defines the Layers class, which allows storing and managing multiple versions of character matrices within a CassiopeiaTree object. This is useful for:

  • Simulations involving imputation of missing data
  • Storing different representations or versions of the character matrix

The Layers class acts as a dictionary, where keys represent different layers and values are the corresponding character matrices.

Dependencies

The cassiopeia/data directory depends on several external libraries:

  • pandas: For data manipulation and storage of character matrices and metadata.
  • numpy: For numerical operations, particularly in dissimilarity calculations.
  • networkx: For representing and manipulating the tree topology.
  • ete3: For parsing and manipulating trees in Newick format.
  • scipy: For scientific computing, including distance calculations and phylogenetic analysis.

It also relies on other modules within the Cassiopeia package, such as cassiopeia.mixins for error handling and cassiopeia.solver.solver_utilities for solver-related utilities.