HybridSolver.py
High-level description
The HybridSolver
class in Cassiopeia implements a hybrid approach to phylogenetic tree reconstruction. It combines a top-down greedy solver with a more precise bottom-up solver to efficiently infer relationships in large datasets. The greedy solver is applied initially to cluster cells until a specified cutoff is reached (either based on the number of cells or the distance to the latest common ancestor). Then, the bottom solver is used to reconstruct subtrees for each cluster, resulting in a complete tree.
Code Structure
The HybridSolver
class inherits from the CassiopeiaSolver
class and utilizes two solver instances: top_solver
(a subclass of GreedySolver
) and bottom_solver
(any subclass of CassiopeiaSolver
). The solve
method orchestrates the hybrid solving process, while apply_top_solver
and apply_bottom_solver
handle the respective solver applications. assess_cutoff
determines when to switch from the top solver to the bottom solver.
References
GreedySolver
: TheHybridSolver
relies on aGreedySolver
instance for the initial top-down clustering.CassiopeiaSolver
: Bothtop_solver
andbottom_solver
are subclasses ofCassiopeiaSolver
, providing a common interface for tree reconstruction.CassiopeiaTree
: TheHybridSolver
operates on aCassiopeiaTree
object, which stores the character matrix, priors, and the resulting tree.
Symbols
HybridSolver
Description
This class implements the hybrid phylogenetic tree reconstruction algorithm. It combines a greedy top-down solver with a more precise bottom-up solver to handle large datasets efficiently.
Inputs
Name | Type | Description |
---|---|---|
top_solver | GreedySolver.GreedySolver | An algorithm to be applied at the top of the tree. Must be a subclass of GreedySolver. |
bottom_solver | CassiopeiaSolver.CassiopeiaSolver | An algorithm to be applied at the bottom of the tree. Must be a subclass of CassiopeiaSolver. |
lca_cutoff | float | Distance to the latest-common-ancestor (LCA) of a subclade to be used as a cutoff for transitioning to the bottom solver. |
cell_cutoff | int | Number of cells in a subclade to be used as a cutoff for transitioning to the bottom solver. |
threads | int | Number of threads to be used. This corresponds to the number of subproblems to be run concurrently with the bottom solver. |
prior_transformation | str | Function to use when transforming priors into weights. |
progress_bar | bool | Indicates if a progress bar should be shown when solve is called. |
Outputs
The HybridSolver
does not directly return any outputs. It modifies the input CassiopeiaTree
object by populating it with the reconstructed tree.
Internal Logic
- Apply Top Solver: The
apply_top_solver
method is used to recursively cluster cells using thetop_solver
until the specified cutoff is reached (eitherlca_cutoff
orcell_cutoff
). - Generate Subproblems: As the top solver clusters cells, it generates subproblems, each consisting of a subtree root and a list of samples within that subtree.
- Apply Bottom Solver: The
apply_bottom_solver
method is applied to each subproblem, using thebottom_solver
to reconstruct the subtree for the corresponding samples. - Combine Subtrees: The resulting subtrees are combined into a single tree, forming the complete reconstructed phylogeny.
- Add Duplicates and Prune: Duplicate samples are added to the tree, and any spurious leaves (not present in the original character matrix) are pruned.
- Populate CassiopeiaTree: The final reconstructed tree is used to populate the input
CassiopeiaTree
object.
Side Effects
- Modifies the input
CassiopeiaTree
object by populating it with the reconstructed tree. - Creates log files for each subproblem solved by the bottom solver.
apply_top_solver
Description
This method recursively applies the top solver to cluster cells until the specified cutoff is reached.
Inputs
Name | Type | Description |
---|---|---|
character_matrix | pd.DataFrame | Character matrix |
samples | List[str] | Samples in the subclade of interest. |
tree | nx.DiGraph | In progress tree for the HybridSolver. |
node_name_generator | Generator[str, None, None] | Generator for creating unique node names while applying the top-solver. |
weights | Optional[Dict[int, Dict[int, float]]] | Weights of character-state combinations, derived from priors if these are available. |
missing_state_indicator | int | Indicator for missing data |
root | Optional[int] | Node ID of the root in the subtree containing the samples. |
Outputs
Name | Type | Description |
---|---|---|
root | int | The ID of the node serving as the root of the tree containing the samples. |
subproblems | List[Tuple[int, List[str]]] | A list of subproblems in the form [subtree-root, subtree-samples]. |
tree | nx.DiGraph | The updated tree with the top solver applied. |
Internal Logic
- Base Case: If there is only one sample, return the sample ID, an empty list of subproblems, and the unmodified tree.
- Perform Split: Use the
top_solver
to partition the samples based on the character matrix and weights. - Create Root: Generate a unique node name for the root of the current subtree.
- Handle Single Clade: If the split results in a single clade, connect all samples in the clade to the root and return.
- Recursive Application: For each clade resulting from the split:
- If the cutoff is reached, add the clade as a subproblem.
- Otherwise, recursively apply
apply_top_solver
to the clade, connect the resulting subtree to the current root, and add any new subproblems.
- Return Results: Return the root ID, the list of subproblems, and the updated tree.
apply_bottom_solver
Description
This method applies the bottom solver to a given subproblem, reconstructing the subtree for the corresponding samples.
Inputs
Name | Type | Description |
---|---|---|
cassiopeia_tree | CassiopeiaTree | CassiopeiaTree for the entire dataset. This will be subsetted with respect to the samples specified. |
root | int | Identifier of the root in the master tree |
samples | List[str] | A list of samples for which to infer a tree. |
logfile | str | Base location for logging output. A specific logfile will be created from this base logfile name. |
layer | Optional[str] | Layer storing the character matrix for solving. If None, the default character matrix is used in the CassiopeiaTree. |
Outputs
Name | Type | Description |
---|---|---|
subproblem_tree | nx.DiGraph | A tree in the form of a Networkx graph for the subproblem. |
root | int | The original root identifier. |
Internal Logic
- Base Case: If there is only one sample, create a simple tree with the root connected to the sample and return.
- Subset Character Matrix: Extract the relevant portion of the character matrix corresponding to the input samples.
- Create Subtree: Create a new
CassiopeiaTree
object for the subproblem, using the subsetted character matrix and priors from the original tree. - Solve Subproblem: Apply the
bottom_solver
to the subtree, reconstructing the phylogeny for the subproblem samples. - Connect to Root: Connect the root of the reconstructed subtree to the input root ID.
- Return Results: Return the reconstructed subtree and the original root ID.
assess_cutoff
Description
This method determines whether the specified cutoff has been reached for a given set of samples.
Inputs
Name | Type | Description |
---|---|---|
samples | List[str] | A list of samples in a clade. |
character_matrix | pd.DataFrame | Character matrix |
missing_state_indicator | int | Indicator for missing data. |
Outputs
Name | Type | Description |
---|---|---|
cutoff_reached | bool | True if the cutoff is reached, False if not. |
Internal Logic
- Cell Cutoff: If
cell_cutoff
is specified, check if the number of samples is less than or equal to the cutoff. If so, return True. - LCA Cutoff: If
lca_cutoff
is specified:- Calculate the LCA character states for the samples.
- Calculate the Hamming distance between each sample’s character states and the LCA character states.
- If the maximum distance is less than or equal to the
lca_cutoff
, return True.
- Return False: If neither cutoff is reached, return False.
__add_duplicates_to_tree_and_remove_spurious_leaves
Description
This private method adds duplicate samples to the tree and removes any spurious leaves that are not present in the original character matrix.
Inputs
Name | Type | Description |
---|---|---|
tree | nx.DiGraph | The tree after solving. |
character_matrix | pd.DataFrame | Character matrix |
node_name_generator | Generator[str, None, None] | Generator for creating unique node names. |
Outputs
Name | Type | Description |
---|---|---|
tree | nx.DiGraph | The tree with duplicates added and spurious leaves pruned. |
Internal Logic
- Add Duplicates:
- Identify groups of duplicate samples based on their character states.
- For each group, create a new internal node and connect it to the original sample and all its duplicates.
- Remove Spurious Leaves:
- Identify leaves in the tree that are not present in the character matrix.
- For each spurious leaf, remove it and prune its lineage until a node with at least two children is reached.
- Return Tree: Return the updated tree.
Dependencies
networkx
: Used for representing and manipulating the tree structure.pandas
: Used for handling the character matrix and dissimilarity map.multiprocessing
: Used for parallelizing the bottom solver application.tqdm
: Used for displaying progress bars.
Error Handling
HybridSolverError
is raised if no cutoff is specified (eitherlca_cutoff
orcell_cutoff
).
Logging
The solve
method creates log files for each subproblem solved by the bottom solver. The base file name for the log files is specified by the logfile
argument, and a unique identifier is appended to each file name.
TODOs
None.