Skip to main contentHere’s a high-level description of the utilities.py file in the cassiopeia/data directory:
This file contains utility functions for working with Cassiopeia datasets, particularly for manipulating and analyzing character matrices, allele tables, and phylogenetic trees. It includes functions for bootstrapping, converting between different data formats, computing dissimilarity maps, and various other operations on tree structures and genetic data.
Code Structure
The file consists of standalone functions that can be grouped into several categories:
- Bootstrapping functions
- Data conversion functions
- Tree manipulation functions
- Dissimilarity and distance computation functions
- Utility functions for working with character states and indels
These functions are used throughout the Cassiopeia package to support various data processing and analysis tasks.Symbols
Here are some of the key functions in this file:
sample_bootstrap_character_matrices
Description
Creates bootstrap samples of character matrices by sampling characters with replacement.
- character_matrix: The original character matrix
- prior_probabilities: Optional prior probabilities for characters
- num_bootstraps: Number of bootstrap samples to create
- random_state: Optional random state for reproducibility
Outputs
A list of tuples, each containing a bootstrapped character matrix and corresponding priors.
convert_alleletable_to_character_matrix
Description
Converts an allele table to a character matrix format.
- alleletable: The input allele table
- Various optional parameters for filtering and processing
Outputs
A tuple containing the character matrix, prior probabilities, and a mapping of states to indels.
compute_phylogenetic_weight_matrix
Description
Computes a phylogenetic weight matrix based on the distances between leaves in a tree.
- tree: A CassiopeiaTree object
- inverse: Whether to compute inverse weights
- inverse_fn: Function to use for inverse computation
Outputs
A pandas DataFrame representing the phylogenetic weight matrix.
Description
Computes the net relatedness index between two groups of indices in a dissimilarity map.
- dissimilarity_map: A numpy array of dissimilarities
- indices_1: First group of indices
- indices_2: Second group of indices
Outputs
The computed net relatedness index as a float.
Dependencies
The file relies on several external libraries, including:
- numpy
- pandas
- networkx
- scipy
- ete3
It also imports from other parts of the Cassiopeia package, particularly from themixins and preprocess modules.
This file is central to many data processing tasks in Cassiopeia, providing essential utilities for working with genetic data and phylogenetic trees.