PercolationSolver.py
High-level description
The PercolationSolver
class implements a top-down, percolation-based algorithm for reconstructing phylogenetic trees. It recursively partitions a set of samples based on their similarity in observed mutations, creating a tree structure where similar samples are grouped closer together.
Code Structure
The PercolationSolver
class is initialized with a joining solver, prior transformation function, similarity function, and a threshold. The solve
method orchestrates the tree reconstruction process by recursively partitioning the sample set using the percolate
method. The percolate
method constructs a similarity graph, removes edges based on similarity, and partitions the graph into clades. If more than two clades are formed, the joining_solver
is used to cluster them. Finally, the __add_duplicates_to_tree
method adds duplicate samples to the tree.
References
This code references the following symbols:
CassiopeiaSolver
(fromcassiopeia.solver
)dissimilarity_functions
(fromcassiopeia.solver
)solver_utilities
(fromcassiopeia.solver
)CassiopeiaTree
(fromcassiopeia.data
)data_utilities
(fromcassiopeia.data.utilities
)
Symbols
PercolationSolver
Description
This class implements the Percolation algorithm for phylogenetic tree reconstruction. It takes a character matrix and optional priors as input and builds a tree by recursively partitioning the sample set based on their similarity in observed mutations.
Inputs
Name | Type | Description |
---|---|---|
joining_solver | CassiopeiaSolver.CassiopeiaSolver | The CassiopeiaSolver used to cluster groups of samples if percolation produces more than two groups. |
prior_transformation | str | A string specifying the transformation to apply to the priors. Defaults to “negative_log”. |
similarity_function | Optional[Callable[[np.array, np.array, int, Optional[Dict[int, Dict[int, float]]]], float]] | A function that calculates the similarity score between two samples. Defaults to hamming_similarity_without_missing . |
threshold | Optional[int] | The minimum similarity threshold for adding an edge in the similarity graph. Defaults to 0. |
Outputs
This class doesn’t directly return any output. It modifies the input CassiopeiaTree
object to store the reconstructed tree.
Internal Logic
- Initialization: The class is initialized with a joining solver, prior transformation function, similarity function, and a threshold.
- Solve: The
solve
method orchestrates the tree reconstruction process.- It first extracts the character matrix and transforms the priors into weights if provided.
- It then creates a NetworkX graph and recursively partitions the sample set using the
percolate
method. - Finally, it adds duplicate samples to the tree and populates the input
CassiopeiaTree
object.
- Percolate: The
percolate
method partitions a set of samples into two groups based on their similarity.- It constructs a similarity graph with samples as nodes and edges weighted by their similarity.
- It then iteratively removes the minimum weighted edges until the graph is disconnected into multiple components.
- If more than two components are formed, it clusters them using the
joining_solver
and returns the two resulting groups.
- Add Duplicates: The
__add_duplicates_to_tree
method adds duplicate samples to the tree by finding groups of samples with identical character states and placing them as sister nodes under a new internal node.
Side Effects
- Modifies the input
CassiopeiaTree
object to store the reconstructed tree.
percolate
Description
Partitions a set of samples into two groups based on their similarity in the character matrix.
Inputs
Name | Type | Description |
---|---|---|
character_matrix | pd.DataFrame | The character matrix representing the mutation profiles of the samples. |
samples | List[str] | A list of sample names to be partitioned. |
priors | Optional[Dict[int, Dict[int, float]]] | A dictionary of prior probabilities for each character/state pair. |
weights | Optional[Dict[int, Dict[int, float]]] | A dictionary of weights for each character/state pair, typically derived from the priors. |
missing_state_indicator | int | The character representing missing data in the character matrix. Defaults to -1. |
Outputs
Name | Type | Description |
---|---|---|
partition_named | Tuple[List[str], List[str]] | A tuple containing two lists of sample names, representing the left and right partitions. |
Internal Logic
- Similarity Graph: Constructs a weighted graph with samples as nodes and edges weighted by their similarity calculated using the provided
similarity_function
. - Percolation: Iteratively removes the minimum weighted edges until the graph is disconnected into multiple connected components.
- Component Merging (if necessary): If more than two connected components are formed, it clusters them using the
joining_solver
based on the LCA of each component and returns the two resulting groups. - Return Partitions: Returns a tuple containing two lists of sample names, representing the left and right partitions.
__add_duplicates_to_tree
Description
Adds duplicate samples to the tree.
Inputs
Name | Type | Description |
---|---|---|
tree | nx.DiGraph | The tree to which duplicate samples need to be added. |
character_matrix | pd.DataFrame | The character matrix representing the mutation profiles of the samples. |
node_name_generator | Generator[str, None, None] | A generator object that produces unique node names. |
Outputs
Name | Type | Description |
---|---|---|
tree | nx.DiGraph | The tree with duplicate samples added. |
Internal Logic
- Identify Duplicates: Identifies groups of samples with identical character states in the character matrix.
- Add Duplicates: For each group of duplicates:
- Creates a new internal node as the parent of the duplicates.
- Connects the original sample and all its duplicates as children of this new internal node.
- Return Tree: Returns the modified tree with the duplicate samples added.