SpectralNeighborJoiningSolver.py
High-level description
The SpectralNeighborJoiningSolver
class in Cassiopeia implements a phylogenetic tree reconstruction algorithm based on the spectral neighbor-joining method. It iteratively joins pairs of nodes (or groups of nodes) based on the second singular value of a specific matrix derived from the pairwise similarities between nodes. This process continues until a complete tree is formed.
Code Structure
The SpectralNeighborJoiningSolver
class inherits from the DistanceSolver
class. It overrides the find_cherry
and update_dissimilarity_map
methods to implement the spectral neighbor-joining logic. The _compute_svd2
method is a helper function used to calculate the second singular value of the relevant matrix.
References
This solver references the dissimilarity_functions
module for calculating pairwise similarities between nodes and the solver_utilities
module for helper functions like generating unique node names.
Symbols
SpectralNeighborJoiningSolver
Description
This class implements the spectral neighbor-joining algorithm for phylogenetic tree reconstruction. It uses a spectral method to identify pairs of nodes to join based on their pairwise similarities.
Inputs
Name | Type | Description |
---|---|---|
similarity_function | Optional[Callable[[np.ndarray, np.ndarray, int, Dict[int, Dict[int, float]]], float]] | A function to compute the pairwise similarity between nodes. Defaults to exponential_negative_hamming_distance . |
add_root | bool | Whether to add an implicit root to the tree. Defaults to False . |
prior_transformation | str | The transformation to apply to priors when calculating weights. Defaults to "negative_log" . |
Outputs
The solver populates the tree
attribute of the input CassiopeiaTree
object with the reconstructed tree.
Internal Logic
The solver iteratively performs the following steps:
- Calculate the dissimilarity map: This map stores the pairwise dissimilarities between all nodes.
- Find a cherry: Using the
find_cherry
method, the solver identifies the pair of nodes with the smallest second singular value of their corresponding RA matrix (see_compute_svd2
for details). - Update the dissimilarity map: The
update_dissimilarity_map
method updates the dissimilarity map to reflect the newly joined cherry. - Repeat steps 2-3 until only two nodes remain.
- Root the tree: If
add_root
isTrue
, the solver adds an implicit root to the tree.
get_dissimilarity_map
Description
This method obtains or generates the initial dissimilarity map used by the solver. It calls the parent class’s get_dissimilarity_map
method and then constructs the initial lambda matrix, where each subset is a singleton leaf node.
Inputs
Name | Type | Description |
---|---|---|
cassiopeia_tree | CassiopeiaTree | The CassiopeiaTree object passed into solve . |
layer | Optional[str] | Layer storing the character matrix for solving. If None, the default character matrix is used. |
Outputs
Name | Type | Description |
---|---|---|
lambda_matrix_df | pd.DataFrame | The initial lambda matrix, structured similarly to the dissimilarity_map in DistanceSolver . |
Internal Logic
- Obtain the similarity map: Calls the parent class’s
get_dissimilarity_map
method to get the pairwise similarities between nodes. - Prepare for the lambda matrix: Initializes the
_lambda_indices
attribute, which stores the subsets of nodes being considered. Initially, each subset is a singleton leaf node. - Generate the lambda matrix: Iterates through all pairs of subsets and calculates the second singular value of their corresponding RA matrix using the
_compute_svd2
method. - Convert to DataFrame: Converts the lambda matrix from a NumPy array to a Pandas DataFrame.
find_cherry
Description
This method finds the pair of nodes (or subsets of nodes) to join into a cherry based on the lambda matrix. The cherry is identified as the pair with the smallest second singular value in the lambda matrix.
Inputs
Name | Type | Description |
---|---|---|
dissimilarity_map | np.ndarray | The lambda matrix. |
Outputs
Name | Type | Description |
---|---|---|
(i, j) | Tuple[int, int] | A tuple of indices representing the rows in the lambda matrix corresponding to the nodes to be joined. |
_compute_svd2
Description
This method calculates the second largest singular value of the RA matrix for a given pair of subsets of nodes. The RA matrix is constructed based on the pairwise similarities between nodes in the subsets and their complements.
Inputs
Name | Type | Description |
---|---|---|
pair | Tuple[int, int] | A tuple of indices representing the rows in the lambda matrix corresponding to the two subsets of nodes. |
lambda_indices | List[List[int]] | The list of subsets of nodes being considered. |
Outputs
Name | Type | Description |
---|---|---|
svd2_val | float | The second largest singular value of the RA matrix for the given pair of subsets. |
Internal Logic
- Get the combined subset: Combines the two subsets of nodes specified by the
pair
argument. - Get the complement: Identifies the nodes that are not in the combined subset.
- Reconstruct the RA matrix: Constructs the RA matrix based on the pairwise similarities between nodes in the combined subset and its complement.
- Calculate the second largest singular value: Computes the singular value decomposition of the RA matrix and returns the second largest singular value.
update_dissimilarity_map
Description
This method updates the lambda matrix after a cherry is formed. It removes the rows and columns corresponding to the joined nodes and adds a new row and column for the newly formed cherry. The entries in the new row and column are calculated using the _compute_svd2
method.
Inputs
Name | Type | Description |
---|---|---|
similarity_map | pd.DataFrame | The lambda matrix to update. |
cherry | Tuple[str, str] | The names of the two nodes that were joined into a cherry. |
new_node | str | The name of the newly formed cherry node. |
Outputs
Name | Type | Description |
---|---|---|
lambda_matrix_df | pd.DataFrame | The updated lambda matrix. |
Internal Logic
- Identify cherry nodes: Finds the indices of the cherry nodes in the lambda matrix.
- Modify names and indices: Updates the node names and
_lambda_indices
attribute to reflect the newly formed cherry. - Extract the old lambda matrix: Removes the rows and columns corresponding to the joined nodes from the lambda matrix.
- Add new row and column: Adds a new row and column to the lambda matrix for the newly formed cherry.
- Compute new singular values: Calculates the second singular values for the new row and column using the
_compute_svd2
method. - Regenerate lambda matrix: Constructs a new DataFrame from the updated lambda matrix.
setup_root_finder
Description
This method defines the implicit rooting strategy for the solver. It adds an implicit root with all-zero character states to the character matrix and recomputes the dissimilarity map.
Inputs
Name | Type | Description |
---|---|---|
cassiopeia_tree | CassiopeiaTree | The CassiopeiaTree object passed into solve . |
Internal Logic
- Add implicit root: Adds a new row to the character matrix representing the implicit root with all-zero character states.
- Recompute dissimilarity map: Computes the dissimilarity map using the updated character matrix.
root_tree
Description
This method roots the tree by adding an edge between the last two remaining nodes and converting the undirected graph to a directed graph with respect to the specified root sample.
Inputs
Name | Type | Description |
---|---|---|
tree | nx.Graph | The undirected tree topology. |
root_sample | str | The name of the sample to treat as the root. |
remaining_samples | List[str] | The names of the last two unjoined nodes. |
Outputs
Name | Type | Description |
---|---|---|
rooted_tree | nx.DiGraph | The rooted tree. |
Dependencies
Dependency | Purpose |
---|---|
networkx | Representing and manipulating the tree topology. |
numpy | Numerical operations and array manipulation. |
pandas | Data manipulation and representation of the dissimilarity map. |
scipy | Singular value decomposition calculation. |
ete3 | Tree manipulation and visualization (used in _ccphylo_solve ). |
configparser | Reading configuration settings (used in _setup_ccphylo ). |
subprocess | Running external commands (used in _ccphylo_solve ). |
tempfile | Creating temporary files and directories (used in _ccphylo_solve ). |
Configuration
The _ccphylo_solve
method uses the ccphylo_path
setting from the config.ini
file to locate the CCPhylo executable.
Error Handling
The DistanceSolverError
exception is raised for various error conditions, such as missing rooting parameters or invalid dissimilarity functions.
Logging
Currently, no logging mechanisms are implemented in this code.