Here’s a high-level description of the utilities.py file in the cassiopeia/data directory:

This file contains utility functions for working with Cassiopeia datasets, particularly for manipulating and analyzing character matrices, allele tables, and phylogenetic trees. It includes functions for bootstrapping, converting between different data formats, computing dissimilarity maps, and various other operations on tree structures and genetic data.

Code Structure

The file consists of standalone functions that can be grouped into several categories:

  1. Bootstrapping functions
  2. Data conversion functions
  3. Tree manipulation functions
  4. Dissimilarity and distance computation functions
  5. Utility functions for working with character states and indels

These functions are used throughout the Cassiopeia package to support various data processing and analysis tasks.

Symbols

Here are some of the key functions in this file:

sample_bootstrap_character_matrices

Description

Creates bootstrap samples of character matrices by sampling characters with replacement.

Inputs

  • character_matrix: The original character matrix
  • prior_probabilities: Optional prior probabilities for characters
  • num_bootstraps: Number of bootstrap samples to create
  • random_state: Optional random state for reproducibility

Outputs

A list of tuples, each containing a bootstrapped character matrix and corresponding priors.

convert_alleletable_to_character_matrix

Description

Converts an allele table to a character matrix format.

Inputs

  • alleletable: The input allele table
  • Various optional parameters for filtering and processing

Outputs

A tuple containing the character matrix, prior probabilities, and a mapping of states to indels.

compute_phylogenetic_weight_matrix

Description

Computes a phylogenetic weight matrix based on the distances between leaves in a tree.

Inputs

  • tree: A CassiopeiaTree object
  • inverse: Whether to compute inverse weights
  • inverse_fn: Function to use for inverse computation

Outputs

A pandas DataFrame representing the phylogenetic weight matrix.

net_relatedness_index

Description

Computes the net relatedness index between two groups of indices in a dissimilarity map.

Inputs

  • dissimilarity_map: A numpy array of dissimilarities
  • indices_1: First group of indices
  • indices_2: Second group of indices

Outputs

The computed net relatedness index as a float.

Dependencies

The file relies on several external libraries, including:

  • numpy
  • pandas
  • networkx
  • scipy
  • ete3

It also imports from other parts of the Cassiopeia package, particularly from the mixins and preprocess modules.

This file is central to many data processing tasks in Cassiopeia, providing essential utilities for working with genetic data and phylogenetic trees.