graph_utilities.py
High-level description
This code defines functions for constructing and manipulating graphs used in phylogenetic tree reconstruction. It includes functions for building connectivity graphs based on shared mutations, similarity graphs based on a given similarity function, and optimizing graph partitions using hill-climbing procedures.
Code Structure
The code consists of several functions that are used to construct and manipulate graphs for phylogenetic tree reconstruction. The construct_connectivity_graph
and construct_similarity_graph
functions are used to build different types of graphs, while max_cut_improve_cut
and spectral_improve_cut
are used to optimize partitions on these graphs. The check_if_cut
function is a utility function used by the optimization functions.
References
This code references the solver_utilities
module for converting sample names to indices.
Symbols
check_if_cut
Description
This function checks if two nodes are on opposite sides of a graph partition.
Inputs
Name | Type | Description |
---|---|---|
u | int | The index of the first node. |
v | int | The index of the second node. |
cut | List[int] | A list of node indices representing one side of the partition. |
Outputs
Name | Type | Description |
---|---|---|
bool | True if nodes u and v are on opposite sides of the partition, False otherwise. |
Internal Logic
The function simply checks if one node is in the cut
list and the other is not.
construct_connectivity_graph
Description
This function constructs a connectivity graph for a set of samples based on their observed mutations. The edge weights represent the tendency of two samples to be grouped together or separated based on their shared and differing mutations.
Inputs
Name | Type | Description |
---|---|---|
character_matrix | pd.DataFrame | The character matrix of observed character states for all samples. |
mutation_frequencies | Dict[int, Dict[int, int]] | A dictionary containing the frequencies of each character/state pair. |
missing_state_indicator | int | The character representing missing values. |
samples | List[str] | A list of sample names to include in the graph. |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for character/state pairs. |
Outputs
Name | Type | Description |
---|---|---|
G_names | nx.Graph | The constructed connectivity graph with sample names as nodes. |
Internal Logic
The function iterates through all pairs of samples and calculates a score based on their shared and differing mutations. The score is positive if the samples share mutations and negative if they have different mutations. The absolute value of the score is higher for mutations that are less frequent in the dataset. The scores are then used as edge weights in the graph.
max_cut_improve_cut
Description
This function implements a greedy hill-climbing procedure to optimize a partition for the max-cut criterion on a graph.
Inputs
Name | Type | Description |
---|---|---|
G | nx.Graph | The graph to optimize the partition on. |
cut | List[str] | A list of nodes representing one side of the initial partition. |
Outputs
Name | Type | Description |
---|---|---|
new_cut | List[str] | A new partition that is a local maximum to the max-cut criterion. |
Internal Logic
The function iteratively moves nodes between the two sides of the partition, choosing the move that maximizes the weight of edges across the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.
construct_similarity_graph
Description
This function constructs a similarity graph for a set of samples based on a given similarity function.
Inputs
Name | Type | Description |
---|---|---|
character_matrix | pd.DataFrame | The character matrix of observed character states for all samples. |
missing_state_indicator | int | The character representing missing values. |
samples | List[str] | A list of sample names to include in the graph. |
similarity_function | Callable | A function that calculates a similarity score between two samples. |
threshold | int | A minimum similarity threshold to include an edge in the graph. |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for character/state pairs. |
Outputs
Name | Type | Description |
---|---|---|
G_names | nx.Graph | The constructed similarity graph with sample names as nodes. |
Internal Logic
The function iterates through all pairs of samples and calculates their similarity using the provided similarity_function
. If the similarity score is above the given threshold
, an edge is added between the corresponding nodes in the graph with the weight equal to the similarity score.
spectral_improve_cut
Description
This function implements a greedy hill-climbing procedure to optimize a partition for a modified normalized cut criterion on a graph.
Inputs
Name | Type | Description |
---|---|---|
G | nx.Graph | The graph to optimize the partition on. |
cut | List[str] | A list of nodes representing one side of the initial partition. |
Outputs
Name | Type | Description |
---|---|---|
new_cut | List[str] | A new partition that is a local minimum to the objective function. |
Internal Logic
The function iteratively moves nodes between the two sides of the partition, choosing the move that minimizes the ratio of the weight of edges across the cut to the minimum of the sum of edge weights within each side of the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.