High-level description

The HybridSolver class in Cassiopeia implements a hybrid approach to phylogenetic tree reconstruction. It combines a top-down greedy solver with a more precise bottom-up solver to efficiently infer relationships in large datasets. The greedy solver is applied initially to cluster cells until a specified cutoff is reached (either based on the number of cells or the distance to the latest common ancestor). Then, the bottom solver is used to reconstruct subtrees for each cluster, resulting in a complete tree.

Code Structure

The HybridSolver class inherits from the CassiopeiaSolver class and utilizes two solver instances: top_solver (a subclass of GreedySolver) and bottom_solver (any subclass of CassiopeiaSolver). The solve method orchestrates the hybrid solving process, while apply_top_solver and apply_bottom_solver handle the respective solver applications. assess_cutoff determines when to switch from the top solver to the bottom solver.

References

  • GreedySolver: The HybridSolver relies on a GreedySolver instance for the initial top-down clustering.
  • CassiopeiaSolver: Both top_solver and bottom_solver are subclasses of CassiopeiaSolver, providing a common interface for tree reconstruction.
  • CassiopeiaTree: The HybridSolver operates on a CassiopeiaTree object, which stores the character matrix, priors, and the resulting tree.

Symbols

HybridSolver

Description

This class implements the hybrid phylogenetic tree reconstruction algorithm. It combines a greedy top-down solver with a more precise bottom-up solver to handle large datasets efficiently.

Inputs

NameTypeDescription
top_solverGreedySolver.GreedySolverAn algorithm to be applied at the top of the tree. Must be a subclass of GreedySolver.
bottom_solverCassiopeiaSolver.CassiopeiaSolverAn algorithm to be applied at the bottom of the tree. Must be a subclass of CassiopeiaSolver.
lca_cutofffloatDistance to the latest-common-ancestor (LCA) of a subclade to be used as a cutoff for transitioning to the bottom solver.
cell_cutoffintNumber of cells in a subclade to be used as a cutoff for transitioning to the bottom solver.
threadsintNumber of threads to be used. This corresponds to the number of subproblems to be run concurrently with the bottom solver.
prior_transformationstrFunction to use when transforming priors into weights.
progress_barboolIndicates if a progress bar should be shown when solve is called.

Outputs

The HybridSolver does not directly return any outputs. It modifies the input CassiopeiaTree object by populating it with the reconstructed tree.

Internal Logic

  1. Apply Top Solver: The apply_top_solver method is used to recursively cluster cells using the top_solver until the specified cutoff is reached (either lca_cutoff or cell_cutoff).
  2. Generate Subproblems: As the top solver clusters cells, it generates subproblems, each consisting of a subtree root and a list of samples within that subtree.
  3. Apply Bottom Solver: The apply_bottom_solver method is applied to each subproblem, using the bottom_solver to reconstruct the subtree for the corresponding samples.
  4. Combine Subtrees: The resulting subtrees are combined into a single tree, forming the complete reconstructed phylogeny.
  5. Add Duplicates and Prune: Duplicate samples are added to the tree, and any spurious leaves (not present in the original character matrix) are pruned.
  6. Populate CassiopeiaTree: The final reconstructed tree is used to populate the input CassiopeiaTree object.

Side Effects

  • Modifies the input CassiopeiaTree object by populating it with the reconstructed tree.
  • Creates log files for each subproblem solved by the bottom solver.

apply_top_solver

Description

This method recursively applies the top solver to cluster cells until the specified cutoff is reached.

Inputs

NameTypeDescription
character_matrixpd.DataFrameCharacter matrix
samplesList[str]Samples in the subclade of interest.
treenx.DiGraphIn progress tree for the HybridSolver.
node_name_generatorGenerator[str, None, None]Generator for creating unique node names while applying the top-solver.
weightsOptional[Dict[int, Dict[int, float]]]Weights of character-state combinations, derived from priors if these are available.
missing_state_indicatorintIndicator for missing data
rootOptional[int]Node ID of the root in the subtree containing the samples.

Outputs

NameTypeDescription
rootintThe ID of the node serving as the root of the tree containing the samples.
subproblemsList[Tuple[int, List[str]]]A list of subproblems in the form [subtree-root, subtree-samples].
treenx.DiGraphThe updated tree with the top solver applied.

Internal Logic

  1. Base Case: If there is only one sample, return the sample ID, an empty list of subproblems, and the unmodified tree.
  2. Perform Split: Use the top_solver to partition the samples based on the character matrix and weights.
  3. Create Root: Generate a unique node name for the root of the current subtree.
  4. Handle Single Clade: If the split results in a single clade, connect all samples in the clade to the root and return.
  5. Recursive Application: For each clade resulting from the split:
    • If the cutoff is reached, add the clade as a subproblem.
    • Otherwise, recursively apply apply_top_solver to the clade, connect the resulting subtree to the current root, and add any new subproblems.
  6. Return Results: Return the root ID, the list of subproblems, and the updated tree.

apply_bottom_solver

Description

This method applies the bottom solver to a given subproblem, reconstructing the subtree for the corresponding samples.

Inputs

NameTypeDescription
cassiopeia_treeCassiopeiaTreeCassiopeiaTree for the entire dataset. This will be subsetted with respect to the samples specified.
rootintIdentifier of the root in the master tree
samplesList[str]A list of samples for which to infer a tree.
logfilestrBase location for logging output. A specific logfile will be created from this base logfile name.
layerOptional[str]Layer storing the character matrix for solving. If None, the default character matrix is used in the CassiopeiaTree.

Outputs

NameTypeDescription
subproblem_treenx.DiGraphA tree in the form of a Networkx graph for the subproblem.
rootintThe original root identifier.

Internal Logic

  1. Base Case: If there is only one sample, create a simple tree with the root connected to the sample and return.
  2. Subset Character Matrix: Extract the relevant portion of the character matrix corresponding to the input samples.
  3. Create Subtree: Create a new CassiopeiaTree object for the subproblem, using the subsetted character matrix and priors from the original tree.
  4. Solve Subproblem: Apply the bottom_solver to the subtree, reconstructing the phylogeny for the subproblem samples.
  5. Connect to Root: Connect the root of the reconstructed subtree to the input root ID.
  6. Return Results: Return the reconstructed subtree and the original root ID.

assess_cutoff

Description

This method determines whether the specified cutoff has been reached for a given set of samples.

Inputs

NameTypeDescription
samplesList[str]A list of samples in a clade.
character_matrixpd.DataFrameCharacter matrix
missing_state_indicatorintIndicator for missing data.

Outputs

NameTypeDescription
cutoff_reachedboolTrue if the cutoff is reached, False if not.

Internal Logic

  1. Cell Cutoff: If cell_cutoff is specified, check if the number of samples is less than or equal to the cutoff. If so, return True.
  2. LCA Cutoff: If lca_cutoff is specified:
    • Calculate the LCA character states for the samples.
    • Calculate the Hamming distance between each sample’s character states and the LCA character states.
    • If the maximum distance is less than or equal to the lca_cutoff, return True.
  3. Return False: If neither cutoff is reached, return False.

__add_duplicates_to_tree_and_remove_spurious_leaves

Description

This private method adds duplicate samples to the tree and removes any spurious leaves that are not present in the original character matrix.

Inputs

NameTypeDescription
treenx.DiGraphThe tree after solving.
character_matrixpd.DataFrameCharacter matrix
node_name_generatorGenerator[str, None, None]Generator for creating unique node names.

Outputs

NameTypeDescription
treenx.DiGraphThe tree with duplicates added and spurious leaves pruned.

Internal Logic

  1. Add Duplicates:
    • Identify groups of duplicate samples based on their character states.
    • For each group, create a new internal node and connect it to the original sample and all its duplicates.
  2. Remove Spurious Leaves:
    • Identify leaves in the tree that are not present in the character matrix.
    • For each spurious leaf, remove it and prune its lineage until a node with at least two children is reached.
  3. Return Tree: Return the updated tree.

Dependencies

  • networkx: Used for representing and manipulating the tree structure.
  • pandas: Used for handling the character matrix and dissimilarity map.
  • multiprocessing: Used for parallelizing the bottom solver application.
  • tqdm: Used for displaying progress bars.

Error Handling

  • HybridSolverError is raised if no cutoff is specified (either lca_cutoff or cell_cutoff).

Logging

The solve method creates log files for each subproblem solved by the bottom solver. The base file name for the log files is specified by the logfile argument, and a unique identifier is appended to each file name.

TODOs

None.