HybridSolver.py - Glide Demo

High-level description

The HybridSolver class in Cassiopeia implements a hybrid approach to phylogenetic tree reconstruction. It combines a top-down greedy solver with a more precise bottom-up solver to efficiently infer relationships in large datasets. The greedy solver is applied initially to cluster cells until a specified cutoff is reached (either based on the number of cells or the distance to the latest common ancestor). Then, the bottom solver is used to reconstruct subtrees for each cluster, resulting in a complete tree.

Code Structure

The HybridSolver class inherits from the CassiopeiaSolver class and utilizes two solver instances: top_solver (a subclass of GreedySolver) and bottom_solver (any subclass of CassiopeiaSolver). The solve method orchestrates the hybrid solving process, while apply_top_solver and apply_bottom_solver handle the respective solver applications. assess_cutoff determines when to switch from the top solver to the bottom solver.

References

GreedySolver: The HybridSolver relies on a GreedySolver instance for the initial top-down clustering.
CassiopeiaSolver: Both top_solver and bottom_solver are subclasses of CassiopeiaSolver, providing a common interface for tree reconstruction.
CassiopeiaTree: The HybridSolver operates on a CassiopeiaTree object, which stores the character matrix, priors, and the resulting tree.

Symbols

`HybridSolver`

Description

This class implements the hybrid phylogenetic tree reconstruction algorithm. It combines a greedy top-down solver with a more precise bottom-up solver to handle large datasets efficiently.

Inputs

Name	Type	Description
top_solver	GreedySolver.GreedySolver	An algorithm to be applied at the top of the tree. Must be a subclass of GreedySolver.
bottom_solver	CassiopeiaSolver.CassiopeiaSolver	An algorithm to be applied at the bottom of the tree. Must be a subclass of CassiopeiaSolver.
lca_cutoff	float	Distance to the latest-common-ancestor (LCA) of a subclade to be used as a cutoff for transitioning to the bottom solver.
cell_cutoff	int	Number of cells in a subclade to be used as a cutoff for transitioning to the bottom solver.
threads	int	Number of threads to be used. This corresponds to the number of subproblems to be run concurrently with the bottom solver.
prior_transformation	str	Function to use when transforming priors into weights.
progress_bar	bool	Indicates if a progress bar should be shown when `solve` is called.

Outputs

The HybridSolver does not directly return any outputs. It modifies the input CassiopeiaTree object by populating it with the reconstructed tree.

Internal Logic

Apply Top Solver: The apply_top_solver method is used to recursively cluster cells using the top_solver until the specified cutoff is reached (either lca_cutoff or cell_cutoff).
Generate Subproblems: As the top solver clusters cells, it generates subproblems, each consisting of a subtree root and a list of samples within that subtree.
Apply Bottom Solver: The apply_bottom_solver method is applied to each subproblem, using the bottom_solver to reconstruct the subtree for the corresponding samples.
Combine Subtrees: The resulting subtrees are combined into a single tree, forming the complete reconstructed phylogeny.
Add Duplicates and Prune: Duplicate samples are added to the tree, and any spurious leaves (not present in the original character matrix) are pruned.
Populate CassiopeiaTree: The final reconstructed tree is used to populate the input CassiopeiaTree object.

Side Effects

Modifies the input CassiopeiaTree object by populating it with the reconstructed tree.
Creates log files for each subproblem solved by the bottom solver.

`apply_top_solver`

Description

This method recursively applies the top solver to cluster cells until the specified cutoff is reached.

Inputs

Name	Type	Description
character_matrix	pd.DataFrame	Character matrix
samples	List[str]	Samples in the subclade of interest.
tree	nx.DiGraph	In progress tree for the HybridSolver.
node_name_generator	Generator[str, None, None]	Generator for creating unique node names while applying the top-solver.
weights	Optional[Dict[int, Dict[int, float]]]	Weights of character-state combinations, derived from priors if these are available.
missing_state_indicator	int	Indicator for missing data
root	Optional[int]	Node ID of the root in the subtree containing the samples.

Outputs

Name	Type	Description
root	int	The ID of the node serving as the root of the tree containing the samples.
subproblems	List[Tuple[int, List[str]]]	A list of subproblems in the form [subtree-root, subtree-samples].
tree	nx.DiGraph	The updated tree with the top solver applied.

Internal Logic

Base Case: If there is only one sample, return the sample ID, an empty list of subproblems, and the unmodified tree.
Perform Split: Use the top_solver to partition the samples based on the character matrix and weights.
Create Root: Generate a unique node name for the root of the current subtree.
Handle Single Clade: If the split results in a single clade, connect all samples in the clade to the root and return.
Recursive Application: For each clade resulting from the split:
- If the cutoff is reached, add the clade as a subproblem.
- Otherwise, recursively apply apply_top_solver to the clade, connect the resulting subtree to the current root, and add any new subproblems.
Return Results: Return the root ID, the list of subproblems, and the updated tree.

`apply_bottom_solver`

Description

This method applies the bottom solver to a given subproblem, reconstructing the subtree for the corresponding samples.

Inputs

Name	Type	Description
cassiopeia_tree	CassiopeiaTree	CassiopeiaTree for the entire dataset. This will be subsetted with respect to the samples specified.
root	int	Identifier of the root in the master tree
samples	List[str]	A list of samples for which to infer a tree.
logfile	str	Base location for logging output. A specific logfile will be created from this base logfile name.
layer	Optional[str]	Layer storing the character matrix for solving. If None, the default character matrix is used in the CassiopeiaTree.

Outputs

Name	Type	Description
subproblem_tree	nx.DiGraph	A tree in the form of a Networkx graph for the subproblem.
root	int	The original root identifier.

Internal Logic

Base Case: If there is only one sample, create a simple tree with the root connected to the sample and return.
Subset Character Matrix: Extract the relevant portion of the character matrix corresponding to the input samples.
Create Subtree: Create a new CassiopeiaTree object for the subproblem, using the subsetted character matrix and priors from the original tree.
Solve Subproblem: Apply the bottom_solver to the subtree, reconstructing the phylogeny for the subproblem samples.
Connect to Root: Connect the root of the reconstructed subtree to the input root ID.
Return Results: Return the reconstructed subtree and the original root ID.

`assess_cutoff`

Description

This method determines whether the specified cutoff has been reached for a given set of samples.

Inputs

Name	Type	Description
samples	List[str]	A list of samples in a clade.
character_matrix	pd.DataFrame	Character matrix
missing_state_indicator	int	Indicator for missing data.

Outputs

Name	Type	Description
cutoff_reached	bool	True if the cutoff is reached, False if not.

Internal Logic

Cell Cutoff: If cell_cutoff is specified, check if the number of samples is less than or equal to the cutoff. If so, return True.
LCA Cutoff: If lca_cutoff is specified:
- Calculate the LCA character states for the samples.
- Calculate the Hamming distance between each sample’s character states and the LCA character states.
- If the maximum distance is less than or equal to the lca_cutoff, return True.
Return False: If neither cutoff is reached, return False.

`__add_duplicates_to_tree_and_remove_spurious_leaves`

Description

This private method adds duplicate samples to the tree and removes any spurious leaves that are not present in the original character matrix.

Inputs

Name	Type	Description
tree	nx.DiGraph	The tree after solving.
character_matrix	pd.DataFrame	Character matrix
node_name_generator	Generator[str, None, None]	Generator for creating unique node names.

Outputs

Name	Type	Description
tree	nx.DiGraph	The tree with duplicates added and spurious leaves pruned.

Internal Logic

Add Duplicates:
- Identify groups of duplicate samples based on their character states.
- For each group, create a new internal node and connect it to the original sample and all its duplicates.
Remove Spurious Leaves:
- Identify leaves in the tree that are not present in the character matrix.
- For each spurious leaf, remove it and prune its lineage until a node with at least two children is reached.
Return Tree: Return the updated tree.

Dependencies

networkx: Used for representing and manipulating the tree structure.
pandas: Used for handling the character matrix and dissimilarity map.
multiprocessing: Used for parallelizing the bottom solver application.
tqdm: Used for displaying progress bars.

Error Handling

HybridSolverError is raised if no cutoff is specified (either lca_cutoff or cell_cutoff).

Logging

The solve method creates log files for each subproblem solved by the bottom solver. The base file name for the log files is specified by the logfile argument, and a unique identifier is appended to each file name.

TODOs

None.

Project Root

​High-level description

​Code Structure

​References

​Symbols

​HybridSolver

​Description

​Inputs

​Outputs

​Internal Logic

​Side Effects

​apply_top_solver

​Description

​Inputs

​Outputs

​Internal Logic

​apply_bottom_solver

​Description

​Inputs

​Outputs

​Internal Logic

​assess_cutoff

​Description

​Inputs

​Outputs

​Internal Logic

​__add_duplicates_to_tree_and_remove_spurious_leaves

​Description

​Inputs

​Outputs

​Internal Logic

​Dependencies

​Error Handling

​Logging

​TODOs

High-level description

Code Structure

References

Symbols

`HybridSolver`

Description

Inputs

Outputs

Internal Logic

Side Effects

`apply_top_solver`

Description

Inputs

Outputs

Internal Logic

`apply_bottom_solver`

Description

Inputs

Outputs

Internal Logic

`assess_cutoff`

Description

Inputs

Outputs

Internal Logic

`__add_duplicates_to_tree_and_remove_spurious_leaves`

Description

Inputs

Outputs

Internal Logic

Dependencies

Error Handling

Logging

TODOs