High-level description

The NeighborJoiningSolver class implements the Neighbor-Joining algorithm for phylogenetic tree reconstruction. It iteratively joins pairs of samples (or cherries) that minimize the Q-criterion based on a dissimilarity map, ultimately building an unrooted tree. The class provides options for rooting the tree and using fast implementations from the CCPhylo library.

Code Structure

The NeighborJoiningSolver class inherits from the DistanceSolver class and overrides several methods to implement the Neighbor-Joining algorithm. The main methods are find_cherry, update_dissimilarity_map, and root_tree. The find_cherry method identifies the pair of samples to join based on the Q-criterion. The update_dissimilarity_map method updates the dissimilarity map after joining a cherry. The root_tree method roots the tree at a specified sample.

References

  • DistanceSolver: The parent class that defines the general structure for distance-based solvers.
  • dissimilarity_functions: A module containing various dissimilarity functions, including the weighted Hamming distance used by default.
  • solver_utilities: A module containing utility functions for tree solvers.

Symbols

NeighborJoiningSolver

Description

This class implements the Neighbor-Joining algorithm for phylogenetic tree reconstruction. It inherits from the DistanceSolver class and overrides several methods to implement the specific logic of the Neighbor-Joining algorithm.

Inputs

NameTypeDescription
dissimilarity_functionOptional[Callable[[np.array, np.array, int, Dict[int, Dict[int, float]]], float]]A function to compute the dissimilarity between samples. Defaults to weighted_hamming_distance.
add_rootboolWhether to add an implicit root to the tree. Defaults to False.
prior_transformationstrThe transformation to apply to priors when calculating weights. Defaults to "negative_log".
fastboolWhether to use a fast implementation of Neighbor-Joining. Defaults to False.
implementationstrThe fast implementation to use. Options are: "ccphylo_dnj", "ccphylo_hnj", "ccphylo_nj". Defaults to "ccphylo_dnj".
threadsintThe number of threads to use for the solver. Defaults to 1.

Outputs

The solve method populates the tree attribute of the input CassiopeiaTree object with the reconstructed tree.

Internal Logic

The solve method iteratively joins pairs of samples (cherries) that minimize the Q-criterion, updating the dissimilarity map at each step. The process continues until only two samples remain, which are then joined to form the final unrooted tree. The tree can be rooted using the root_tree method.

root_tree

Description

This method roots the tree produced by the Neighbor-Joining algorithm at the specified root sample.

Inputs

NameTypeDescription
treenx.GraphThe unrooted tree topology represented as a NetworkX graph.
root_samplestrThe sample to treat as the root.
remaining_samplesList[str]The last two unjoined nodes in the tree.

Outputs

NameTypeDescription
rooted_treenx.DiGraphThe rooted tree topology represented as a NetworkX directed graph.

Internal Logic

The method first joins the last two remaining samples in the tree. Then, it performs a depth-first search from the specified root sample and creates a directed graph based on the edges traversed.

find_cherry

Description

This method finds a pair of samples to join into a cherry by minimizing the Q-criterion.

Inputs

NameTypeDescription
dissimilarity_matrixnp.arrayA sample x sample dissimilarity matrix.

Outputs

NameTypeDescription
cherryTuple[int, int]A tuple of integers representing the indices of the two samples to join in the dissimilarity matrix.

Internal Logic

The method calculates the Q-criterion for every pair of samples using the compute_q method. The pair with the minimum Q-criterion is selected as the cherry.

compute_q

Description

This static method computes the Q-criterion for every pair of samples in the dissimilarity map.

Inputs

NameTypeDescription
dissimilarity_mapnp.array(int)A sample x sample dissimilarity map.

Outputs

NameTypeDescription
qnp.arrayA matrix storing the Q-criterion for every pair of samples.

Internal Logic

The method iterates through all pairs of samples and calculates the Q-criterion using the formula: Q(i,j) = d(i, j) - 1/(n-2) (sum(d(i, :)) + sum(d(j,:))), where d(i,j) is the dissimilarity between samples i and j, and n is the number of samples.

update_dissimilarity_map

Description

This method updates the dissimilarity map after joining two nodes into a cherry.

Inputs

NameTypeDescription
dissimilarity_mappd.DataFrameThe dissimilarity map to update.
cherryTuple[str, str]A tuple of strings representing the names of the two samples joined in the cherry.
new_nodestrThe name of the new node representing the cherry.

Outputs

NameTypeDescription
dissimilarity_mappd.DataFrameThe updated dissimilarity map.

Internal Logic

The method calculates the new dissimilarities between the new node and all other nodes using the formula: d’(m, v) = 0.5 * (d(v, m1) + d(v, m2) - d(m1, m2)), where m is the new node, v is any other node, and m1 and m2 are the two samples joined in the cherry. The original cherry nodes are then removed from the dissimilarity map.

__update_dissimilarity_map_numba

Description

This private static method provides a faster, Numba-optimized implementation for updating the dissimilarity map.

Inputs

NameTypeDescription
dissimilarity_mapnp.arrayThe dissimilarity map to update, represented as a NumPy array.
cherry_iintThe index of the first sample in the cherry.
cherry_jintThe index of the second sample in the cherry.

Outputs

NameTypeDescription
updated_mapnp.arrayThe updated dissimilarity map.

Internal Logic

The method adds a new row and column to the dissimilarity map for the new node and calculates the new dissimilarities using the same formula as update_dissimilarity_map.

setup_root_finder

Description

This method defines the implicit rooting strategy for the NeighborJoiningSolver.

Inputs

NameTypeDescription
cassiopeia_treeCassiopeiaTreeThe input CassiopeiaTree object.

Outputs

This method modifies the cassiopeia_tree object in place.

Internal Logic

The method adds an implicit root with all-zero character states to the character matrix and recalculates the dissimilarity map. If a dissimilarity map already exists, only the new dissimilarities involving the root are calculated.

Dependencies

  • networkx: For representing and manipulating tree topologies.
  • numba: For optimizing performance-critical code sections.
  • pandas: For handling data in tabular format.
  • ete3: For parsing and manipulating trees in Newick format (used in the _ccphylo_solve method).
  • configparser: For reading configuration settings (used in the _setup_ccphylo method).
  • subprocess: For running external commands (used in the _ccphylo_solve method).

Configuration

The _ccphylo_solve method requires the path to the CCPhylo executable to be specified in the config.ini file under the Paths section with the key ccphylo_path.

Error Handling

The NeighborJoiningSolver class raises DistanceSolverError exceptions for various error conditions, such as missing dissimilarity function, invalid rooting parameters, or errors encountered during the execution of CCPhylo.