High-level description

The GreedySolver class in Cassiopeia is an abstract base class for implementing top-down greedy algorithms for phylogenetic tree reconstruction. It provides a framework for recursively splitting a set of samples based on a split criterion, building the tree from the root down to the leaves. Subclasses of GreedySolver implement the specific logic for performing the split.

Code Structure

The GreedySolver class defines the overall structure of a greedy solver, including methods for performing splits, computing mutation frequencies, and adding duplicate samples to the tree. The perform_split method is an abstract method that must be implemented by subclasses to define the specific split criterion.

References

  • CassiopeiaSolver: The GreedySolver class inherits from the CassiopeiaSolver abstract base class.
  • solver_utilities: The GreedySolver class uses utility functions from the solver_utilities module, such as node_name_generator and transform_priors.
  • missing_data_methods: The GreedySolver class uses functions from the missing_data_methods module to handle missing data during character splits.
  • find_duplicate_groups: The GreedySolver class uses the find_duplicate_groups function from the cassiopeia.mixins module to identify duplicate samples.

Symbols

GreedySolver

Description

This class represents a greedy solver for phylogenetic tree reconstruction. It provides a framework for recursively splitting a set of samples based on a split criterion, building the tree from the root down to the leaves.

Inputs

NameTypeDescription
prior_transformationstrFunction to use when transforming priors into weights. Defaults to “negative_log”.

Outputs

None

Internal Logic

  1. Transform the priors into weights using the specified prior_transformation.
  2. Recursively split the set of samples using the perform_split method: a. Create an ancestral node for each split. b. Place the resulting partitions as daughter clades.
  3. Continue splitting until each partition contains only a single sample.
  4. If a split cannot be performed, create a polytomy.
  5. Add duplicate samples to the tree.
  6. Collapse mutationless edges if specified.

perform_split

Description

This abstract method defines the logic for partitioning a set of samples. It must be implemented by subclasses of GreedySolver.

Inputs

NameTypeDescription
character_matrixpd.DataFrameCharacter matrix
samplesList[int]A list of samples to partition
weightsOptional[Dict[int, Dict[int, float]]]Weighting of each (character, state) pair. Typically a transformation of the priors.
missing_state_indicatorintCharacter representing missing data.

Outputs

NameTypeDescription
left_partitionList[Union[int, str]]The left partition group
right_partitionList[Union[int, str]]The right partition group

compute_mutation_frequencies

Description

This method computes the frequencies of each character/state pair in a given set of samples.

Inputs

NameTypeDescription
samplesList[str]The set of samples to consider
unique_character_matrixpd.DataFrameThe character matrix
missing_state_indicatorintCharacter representing missing data.

Outputs

NameTypeDescription
freq_dictDict[int, Dict[int, int]]A dictionary mapping each character to a dictionary of state/sample frequency pairs

Internal Logic

  1. Iterate over each character in the character matrix.
  2. For each character, count the occurrences of each state for the given samples.
  3. Store the frequencies in a dictionary:
    • Keys are characters
    • Values are dictionaries mapping states to their frequencies

__add_duplicates_to_tree

Description

This private method adds duplicate samples to the tree.

Inputs

NameTypeDescription
treenx.DiGraphThe tree to add duplicates to
character_matrixpd.DataFrameCharacter matrix
node_name_generatorGenerator[str, None, None]Generator for creating unique node names

Outputs

NameTypeDescription
treenx.DiGraphThe tree with duplicates added

Internal Logic

  1. Identify duplicate samples using the find_duplicate_groups function.
  2. For each group of duplicates: a. Create a new internal node. b. Connect the new node to the original sample. c. Connect the new node to all duplicates of the original sample.

Side Effects

The solve method modifies the input CassiopeiaTree object by populating its tree attribute.

Dependencies

  • networkx: Used for representing and manipulating the tree.
  • numpy: Used for numerical operations.
  • pandas: Used for representing the character matrix.