High-level description

This code defines functions for constructing and manipulating graphs used in phylogenetic tree reconstruction. It includes functions for building connectivity graphs based on shared mutations, similarity graphs based on a given similarity function, and optimizing graph partitions using hill-climbing procedures.

Code Structure

The code consists of several functions that are used to construct and manipulate graphs for phylogenetic tree reconstruction. The construct_connectivity_graph and construct_similarity_graph functions are used to build different types of graphs, while max_cut_improve_cut and spectral_improve_cut are used to optimize partitions on these graphs. The check_if_cut function is a utility function used by the optimization functions.

References

This code references the solver_utilities module for converting sample names to indices.

Symbols

check_if_cut

Description

This function checks if two nodes are on opposite sides of a graph partition.

Inputs

NameTypeDescription
uintThe index of the first node.
vintThe index of the second node.
cutList[int]A list of node indices representing one side of the partition.

Outputs

NameTypeDescription
boolTrue if nodes u and v are on opposite sides of the partition, False otherwise.

Internal Logic

The function simply checks if one node is in the cut list and the other is not.

construct_connectivity_graph

Description

This function constructs a connectivity graph for a set of samples based on their observed mutations. The edge weights represent the tendency of two samples to be grouped together or separated based on their shared and differing mutations.

Inputs

NameTypeDescription
character_matrixpd.DataFrameThe character matrix of observed character states for all samples.
mutation_frequenciesDict[int, Dict[int, int]]A dictionary containing the frequencies of each character/state pair.
missing_state_indicatorintThe character representing missing values.
samplesList[str]A list of sample names to include in the graph.
weightsOptional[Dict[int, Dict[int, float]]]Optional weights for character/state pairs.

Outputs

NameTypeDescription
G_namesnx.GraphThe constructed connectivity graph with sample names as nodes.

Internal Logic

The function iterates through all pairs of samples and calculates a score based on their shared and differing mutations. The score is positive if the samples share mutations and negative if they have different mutations. The absolute value of the score is higher for mutations that are less frequent in the dataset. The scores are then used as edge weights in the graph.

max_cut_improve_cut

Description

This function implements a greedy hill-climbing procedure to optimize a partition for the max-cut criterion on a graph.

Inputs

NameTypeDescription
Gnx.GraphThe graph to optimize the partition on.
cutList[str]A list of nodes representing one side of the initial partition.

Outputs

NameTypeDescription
new_cutList[str]A new partition that is a local maximum to the max-cut criterion.

Internal Logic

The function iteratively moves nodes between the two sides of the partition, choosing the move that maximizes the weight of edges across the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.

construct_similarity_graph

Description

This function constructs a similarity graph for a set of samples based on a given similarity function.

Inputs

NameTypeDescription
character_matrixpd.DataFrameThe character matrix of observed character states for all samples.
missing_state_indicatorintThe character representing missing values.
samplesList[str]A list of sample names to include in the graph.
similarity_functionCallableA function that calculates a similarity score between two samples.
thresholdintA minimum similarity threshold to include an edge in the graph.
weightsOptional[Dict[int, Dict[int, float]]]Optional weights for character/state pairs.

Outputs

NameTypeDescription
G_namesnx.GraphThe constructed similarity graph with sample names as nodes.

Internal Logic

The function iterates through all pairs of samples and calculates their similarity using the provided similarity_function. If the similarity score is above the given threshold, an edge is added between the corresponding nodes in the graph with the weight equal to the similarity score.

spectral_improve_cut

Description

This function implements a greedy hill-climbing procedure to optimize a partition for a modified normalized cut criterion on a graph.

Inputs

NameTypeDescription
Gnx.GraphThe graph to optimize the partition on.
cutList[str]A list of nodes representing one side of the initial partition.

Outputs

NameTypeDescription
new_cutList[str]A new partition that is a local minimum to the objective function.

Internal Logic

The function iteratively moves nodes between the two sides of the partition, choosing the move that minimizes the ratio of the weight of edges across the cut to the minimum of the sum of edge weights within each side of the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.