High-level description

The PercolationSolver class implements a top-down, percolation-based algorithm for reconstructing phylogenetic trees. It recursively partitions a set of samples based on their similarity in observed mutations, creating a tree structure where similar samples are grouped closer together.

Code Structure

The PercolationSolver class is initialized with a joining solver, prior transformation function, similarity function, and a threshold. The solve method orchestrates the tree reconstruction process by recursively partitioning the sample set using the percolate method. The percolate method constructs a similarity graph, removes edges based on similarity, and partitions the graph into clades. If more than two clades are formed, the joining_solver is used to cluster them. Finally, the __add_duplicates_to_tree method adds duplicate samples to the tree.

References

This code references the following symbols:

  • CassiopeiaSolver (from cassiopeia.solver)
  • dissimilarity_functions (from cassiopeia.solver)
  • solver_utilities (from cassiopeia.solver)
  • CassiopeiaTree (from cassiopeia.data)
  • data_utilities (from cassiopeia.data.utilities)

Symbols

PercolationSolver

Description

This class implements the Percolation algorithm for phylogenetic tree reconstruction. It takes a character matrix and optional priors as input and builds a tree by recursively partitioning the sample set based on their similarity in observed mutations.

Inputs

NameTypeDescription
joining_solverCassiopeiaSolver.CassiopeiaSolverThe CassiopeiaSolver used to cluster groups of samples if percolation produces more than two groups.
prior_transformationstrA string specifying the transformation to apply to the priors. Defaults to “negative_log”.
similarity_functionOptional[Callable[[np.array, np.array, int, Optional[Dict[int, Dict[int, float]]]], float]]A function that calculates the similarity score between two samples. Defaults to hamming_similarity_without_missing.
thresholdOptional[int]The minimum similarity threshold for adding an edge in the similarity graph. Defaults to 0.

Outputs

This class doesn’t directly return any output. It modifies the input CassiopeiaTree object to store the reconstructed tree.

Internal Logic

  1. Initialization: The class is initialized with a joining solver, prior transformation function, similarity function, and a threshold.
  2. Solve: The solve method orchestrates the tree reconstruction process.
    • It first extracts the character matrix and transforms the priors into weights if provided.
    • It then creates a NetworkX graph and recursively partitions the sample set using the percolate method.
    • Finally, it adds duplicate samples to the tree and populates the input CassiopeiaTree object.
  3. Percolate: The percolate method partitions a set of samples into two groups based on their similarity.
    • It constructs a similarity graph with samples as nodes and edges weighted by their similarity.
    • It then iteratively removes the minimum weighted edges until the graph is disconnected into multiple components.
    • If more than two components are formed, it clusters them using the joining_solver and returns the two resulting groups.
  4. Add Duplicates: The __add_duplicates_to_tree method adds duplicate samples to the tree by finding groups of samples with identical character states and placing them as sister nodes under a new internal node.

Side Effects

  • Modifies the input CassiopeiaTree object to store the reconstructed tree.

percolate

Description

Partitions a set of samples into two groups based on their similarity in the character matrix.

Inputs

NameTypeDescription
character_matrixpd.DataFrameThe character matrix representing the mutation profiles of the samples.
samplesList[str]A list of sample names to be partitioned.
priorsOptional[Dict[int, Dict[int, float]]]A dictionary of prior probabilities for each character/state pair.
weightsOptional[Dict[int, Dict[int, float]]]A dictionary of weights for each character/state pair, typically derived from the priors.
missing_state_indicatorintThe character representing missing data in the character matrix. Defaults to -1.

Outputs

NameTypeDescription
partition_namedTuple[List[str], List[str]]A tuple containing two lists of sample names, representing the left and right partitions.

Internal Logic

  1. Similarity Graph: Constructs a weighted graph with samples as nodes and edges weighted by their similarity calculated using the provided similarity_function.
  2. Percolation: Iteratively removes the minimum weighted edges until the graph is disconnected into multiple connected components.
  3. Component Merging (if necessary): If more than two connected components are formed, it clusters them using the joining_solver based on the LCA of each component and returns the two resulting groups.
  4. Return Partitions: Returns a tuple containing two lists of sample names, representing the left and right partitions.

__add_duplicates_to_tree

Description

Adds duplicate samples to the tree.

Inputs

NameTypeDescription
treenx.DiGraphThe tree to which duplicate samples need to be added.
character_matrixpd.DataFrameThe character matrix representing the mutation profiles of the samples.
node_name_generatorGenerator[str, None, None]A generator object that produces unique node names.

Outputs

NameTypeDescription
treenx.DiGraphThe tree with duplicate samples added.

Internal Logic

  1. Identify Duplicates: Identifies groups of samples with identical character states in the character matrix.
  2. Add Duplicates: For each group of duplicates:
    • Creates a new internal node as the parent of the duplicates.
    • Connects the original sample and all its duplicates as children of this new internal node.
  3. Return Tree: Returns the modified tree with the duplicate samples added.