High-level description

The SpectralNeighborJoiningSolver class in Cassiopeia implements a phylogenetic tree reconstruction algorithm based on the spectral neighbor-joining method. It iteratively joins pairs of nodes (or groups of nodes) based on the second singular value of a specific matrix derived from the pairwise similarities between nodes. This process continues until a complete tree is formed.

Code Structure

The SpectralNeighborJoiningSolver class inherits from the DistanceSolver class. It overrides the find_cherry and update_dissimilarity_map methods to implement the spectral neighbor-joining logic. The _compute_svd2 method is a helper function used to calculate the second singular value of the relevant matrix.

References

This solver references the dissimilarity_functions module for calculating pairwise similarities between nodes and the solver_utilities module for helper functions like generating unique node names.

Symbols

SpectralNeighborJoiningSolver

Description

This class implements the spectral neighbor-joining algorithm for phylogenetic tree reconstruction. It uses a spectral method to identify pairs of nodes to join based on their pairwise similarities.

Inputs

NameTypeDescription
similarity_functionOptional[Callable[[np.ndarray, np.ndarray, int, Dict[int, Dict[int, float]]], float]]A function to compute the pairwise similarity between nodes. Defaults to exponential_negative_hamming_distance.
add_rootboolWhether to add an implicit root to the tree. Defaults to False.
prior_transformationstrThe transformation to apply to priors when calculating weights. Defaults to "negative_log".

Outputs

The solver populates the tree attribute of the input CassiopeiaTree object with the reconstructed tree.

Internal Logic

The solver iteratively performs the following steps:

  1. Calculate the dissimilarity map: This map stores the pairwise dissimilarities between all nodes.
  2. Find a cherry: Using the find_cherry method, the solver identifies the pair of nodes with the smallest second singular value of their corresponding RA matrix (see _compute_svd2 for details).
  3. Update the dissimilarity map: The update_dissimilarity_map method updates the dissimilarity map to reflect the newly joined cherry.
  4. Repeat steps 2-3 until only two nodes remain.
  5. Root the tree: If add_root is True, the solver adds an implicit root to the tree.

get_dissimilarity_map

Description

This method obtains or generates the initial dissimilarity map used by the solver. It calls the parent class’s get_dissimilarity_map method and then constructs the initial lambda matrix, where each subset is a singleton leaf node.

Inputs

NameTypeDescription
cassiopeia_treeCassiopeiaTreeThe CassiopeiaTree object passed into solve.
layerOptional[str]Layer storing the character matrix for solving. If None, the default character matrix is used.

Outputs

NameTypeDescription
lambda_matrix_dfpd.DataFrameThe initial lambda matrix, structured similarly to the dissimilarity_map in DistanceSolver.

Internal Logic

  1. Obtain the similarity map: Calls the parent class’s get_dissimilarity_map method to get the pairwise similarities between nodes.
  2. Prepare for the lambda matrix: Initializes the _lambda_indices attribute, which stores the subsets of nodes being considered. Initially, each subset is a singleton leaf node.
  3. Generate the lambda matrix: Iterates through all pairs of subsets and calculates the second singular value of their corresponding RA matrix using the _compute_svd2 method.
  4. Convert to DataFrame: Converts the lambda matrix from a NumPy array to a Pandas DataFrame.

find_cherry

Description

This method finds the pair of nodes (or subsets of nodes) to join into a cherry based on the lambda matrix. The cherry is identified as the pair with the smallest second singular value in the lambda matrix.

Inputs

NameTypeDescription
dissimilarity_mapnp.ndarrayThe lambda matrix.

Outputs

NameTypeDescription
(i, j)Tuple[int, int]A tuple of indices representing the rows in the lambda matrix corresponding to the nodes to be joined.

_compute_svd2

Description

This method calculates the second largest singular value of the RA matrix for a given pair of subsets of nodes. The RA matrix is constructed based on the pairwise similarities between nodes in the subsets and their complements.

Inputs

NameTypeDescription
pairTuple[int, int]A tuple of indices representing the rows in the lambda matrix corresponding to the two subsets of nodes.
lambda_indicesList[List[int]]The list of subsets of nodes being considered.

Outputs

NameTypeDescription
svd2_valfloatThe second largest singular value of the RA matrix for the given pair of subsets.

Internal Logic

  1. Get the combined subset: Combines the two subsets of nodes specified by the pair argument.
  2. Get the complement: Identifies the nodes that are not in the combined subset.
  3. Reconstruct the RA matrix: Constructs the RA matrix based on the pairwise similarities between nodes in the combined subset and its complement.
  4. Calculate the second largest singular value: Computes the singular value decomposition of the RA matrix and returns the second largest singular value.

update_dissimilarity_map

Description

This method updates the lambda matrix after a cherry is formed. It removes the rows and columns corresponding to the joined nodes and adds a new row and column for the newly formed cherry. The entries in the new row and column are calculated using the _compute_svd2 method.

Inputs

NameTypeDescription
similarity_mappd.DataFrameThe lambda matrix to update.
cherryTuple[str, str]The names of the two nodes that were joined into a cherry.
new_nodestrThe name of the newly formed cherry node.

Outputs

NameTypeDescription
lambda_matrix_dfpd.DataFrameThe updated lambda matrix.

Internal Logic

  1. Identify cherry nodes: Finds the indices of the cherry nodes in the lambda matrix.
  2. Modify names and indices: Updates the node names and _lambda_indices attribute to reflect the newly formed cherry.
  3. Extract the old lambda matrix: Removes the rows and columns corresponding to the joined nodes from the lambda matrix.
  4. Add new row and column: Adds a new row and column to the lambda matrix for the newly formed cherry.
  5. Compute new singular values: Calculates the second singular values for the new row and column using the _compute_svd2 method.
  6. Regenerate lambda matrix: Constructs a new DataFrame from the updated lambda matrix.

setup_root_finder

Description

This method defines the implicit rooting strategy for the solver. It adds an implicit root with all-zero character states to the character matrix and recomputes the dissimilarity map.

Inputs

NameTypeDescription
cassiopeia_treeCassiopeiaTreeThe CassiopeiaTree object passed into solve.

Internal Logic

  1. Add implicit root: Adds a new row to the character matrix representing the implicit root with all-zero character states.
  2. Recompute dissimilarity map: Computes the dissimilarity map using the updated character matrix.

root_tree

Description

This method roots the tree by adding an edge between the last two remaining nodes and converting the undirected graph to a directed graph with respect to the specified root sample.

Inputs

NameTypeDescription
treenx.GraphThe undirected tree topology.
root_samplestrThe name of the sample to treat as the root.
remaining_samplesList[str]The names of the last two unjoined nodes.

Outputs

NameTypeDescription
rooted_treenx.DiGraphThe rooted tree.

Dependencies

DependencyPurpose
networkxRepresenting and manipulating the tree topology.
numpyNumerical operations and array manipulation.
pandasData manipulation and representation of the dissimilarity map.
scipySingular value decomposition calculation.
ete3Tree manipulation and visualization (used in _ccphylo_solve).
configparserReading configuration settings (used in _setup_ccphylo).
subprocessRunning external commands (used in _ccphylo_solve).
tempfileCreating temporary files and directories (used in _ccphylo_solve).

Configuration

The _ccphylo_solve method uses the ccphylo_path setting from the config.ini file to locate the CCPhylo executable.

Error Handling

The DistanceSolverError exception is raised for various error conditions, such as missing rooting parameters or invalid dissimilarity functions.

Logging

Currently, no logging mechanisms are implemented in this code.