dissimilarity_functions.py

High-level description

This file provides a collection of dissimilarity and similarity functions used for comparing phylogenetic samples. These functions calculate the distance or similarity between pairs of samples based on their character states, considering factors like missing data, state weights, and linkage methods.

Code Structure

This code defines several functions for calculating dissimilarity and similarity between phylogenetic samples. Some functions are optimized using the numba library. The cluster_dissimilarity function utilizes other dissimilarity functions and linkage functions to handle ambiguous character strings.

Symbols

`weighted_hamming_distance`

Description

Calculates the weighted Hamming distance between two phylogenetic samples. It considers shared indel states and their probabilities, incrementing dissimilarity for differing states and decrementing for identical states based on their weights.

Inputs

Name	Type	Description
s1	List[int]	Character states of the first sample
s2	List[int]	Character states of the second sample
missing_state_indicator	int	The character representing missing values (default: -1)
weights	Optional[Dict[int, Dict[int, float]]]	A nested dictionary storing state weights for each character, derived from state priors (default: None)

Outputs

Name	Type	Description
distance	float	The weighted Hamming distance between the two samples

Internal Logic

The function iterates through each character of the samples. If both states are identical and not missing, it decrements the dissimilarity by the probability of them occurring independently. If the states disagree, it increments the dissimilarity by the probability of those states occurring. The final dissimilarity is normalized by the number of non-missing characters shared by the samples.

`hamming_similarity_without_missing`

Description

Calculates the number of shared (non-missing) character/state mutations between two samples.

Inputs

Name	Type	Description
s1	List[int]	Character states of the first sample
s2	List[int]	Character states of the second sample
missing_state_indicator	int	The character representing missing values
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights to weight the similarity of a mutation (default: None)

Outputs

Name	Type	Description
similarity	float	The number of shared mutations, weighted or unweighted

Internal Logic

The function iterates through each character, ignoring missing states and uncut states (state 0). For each matching state, it increments the similarity by 1 or by the corresponding weight if provided.

`hamming_similarity_normalized_over_missing`

Description

Calculates the number of shared (non-missing) character/state mutations between two samples, normalized by the number of missing data events.

Inputs

Name	Type	Description
s1	List[int]	Character states of the first sample
s2	List[int]	Character states of the second sample
missing_state_indicator	int	The character representing missing values
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights to weight the similarity of a mutation (default: None)

Outputs

Name	Type	Description
similarity	float	The normalized number of shared mutations, weighted or unweighted

Internal Logic

Similar to hamming_similarity_without_missing, but it also tracks the number of non-missing characters (num_present). The final similarity score is divided by num_present to normalize for missing data.

`hamming_distance`

Description

Calculates the vanilla Hamming distance between two samples, optionally ignoring missing data.

Inputs

Name	Type	Description
s1	np.array(int)	The first sample
s2	np.array(int)	The second sample
ignore_missing_state	bool	Whether to ignore missing states in the comparison (default: False)
missing_state_indicator	int	The character representing missing values (default: -1)

Outputs

Name	Type	Description
distance	int	The Hamming distance between the two samples

Internal Logic

The function iterates through each character, incrementing the distance for each mismatch unless both states are missing and ignore_missing_state is True.

`weighted_hamming_similarity`

Description

Calculates the weighted number of shared (non-missing) character/state mutations between two samples.

Inputs

Name	Type	Description
s1	List[int]	Character states of the first sample
s2	List[int]	Character states of the second sample
missing_state_indicator	int	The character representing missing values
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights to weight the similarity of a mutation (default: None)

Outputs

Name	Type	Description
similarity	float	The weighted number of shared mutations

Internal Logic

The function iterates through each character, ignoring missing states. For each matching state, it increments the similarity by 2 times the corresponding weight (or 2 if no weights are provided) if the state is not 0, and by 1 if the state is 0 and no weights are provided. The final similarity is normalized by the number of non-missing characters.

`exponential_negative_hamming_distance`

Description

Calculates a similarity score based on the inverse of the weighted Hamming distance. It uses the formula exp(-d(i,j)), where d is the weighted Hamming distance.

Inputs

Name	Type	Description
s1	List[int]	Character states of the first sample
s2	List[int]	Character states of the second sample
missing_state_indicator	int	The character representing missing values (default: -1)
weights	Optional[Dict[int, Dict[int, float]]]	A nested dictionary storing state weights for each character (default: None)

Outputs

Name	Type	Description
similarity	float	The similarity score

Internal Logic

The function first calculates the weighted Hamming distance (weighted_hamm_dist) using the same logic as weighted_hamming_distance. Then, it returns the exponential of the negative of this distance.

`cluster_dissimilarity`

Description

Calculates the dissimilarity between two potentially ambiguous character strings. An ambiguous character string can have multiple possible states for each character.

Inputs

Name	Type	Description
dissimilarity_function	Callable[[List[int], List[int], int, Dict[int, Dict[int, float]]], float]	The dissimilarity function to use for pairwise comparisons
s1	Union[List[int], List[Tuple[int, …]]]	The first (possibly ambiguous) sample
s2	Union[List[int], List[Tuple[int, …]]]	The second (possibly ambiguous) sample
missing_state_indicator	int	The character representing missing values
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights for the dissimilarity function (default: None)
linkage_function	Callable[[Union[np.array, List[float]]], float]	The linkage function to aggregate dissimilarities (default: np.mean)
normalize	bool	Whether to normalize the dissimilarity by the proportion of shared non-missing positions (default: True)

Outputs

Name	Type	Description
dissimilarity	float	The dissimilarity between the two samples

Internal Logic

The function iterates through each character of the input strings. If a character is ambiguous (represented as a tuple), it calculates the dissimilarity for all possible combinations of states using the provided dissimilarity_function and aggregates them using the linkage_function. The final dissimilarity is the sum of individual character dissimilarities, optionally normalized by the number of shared non-missing positions.

`cluster_dissimilarity_weighted_hamming_distance_min_linkage`

Description

Calculates the dissimilarity between two potentially ambiguous character strings using the weighted Hamming distance and minimum linkage.

Inputs

Name	Type	Description
s1	Union[List[int], List[Tuple[int, …]]]	The first (possibly ambiguous) sample
s2	Union[List[int], List[Tuple[int, …]]]	The second (possibly ambiguous) sample
missing_state_indicator	int	The character representing missing values
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights for the weighted Hamming distance (default: None)

Outputs

Name	Type	Description
dissimilarity	float	The dissimilarity between the two samples

Internal Logic

This function is similar to cluster_dissimilarity but specifically uses the weighted Hamming distance as the dissimilarity function and minimum linkage (np.min) to aggregate dissimilarities. It iterates through each character, calculates the weighted Hamming distance for all possible state combinations if the character is ambiguous, and takes the minimum distance. The final dissimilarity is normalized by the number of shared non-missing positions.

TODOs

hamming_similarity_without_missing: Optimize this using masks
hamming_similarity_normalized_over_missing: Optimize this using masks

Project Root

​High-level description

​Code Structure

​Symbols

​weighted_hamming_distance

​Description

​Inputs

​Outputs

​Internal Logic

​hamming_similarity_without_missing

​Description

​Inputs

​Outputs

​Internal Logic

​hamming_similarity_normalized_over_missing

​Description

​Inputs

​Outputs

​Internal Logic

​hamming_distance

​Description

​Inputs

​Outputs

​Internal Logic

​weighted_hamming_similarity

​Description

​Inputs

​Outputs

​Internal Logic

​exponential_negative_hamming_distance

​Description

​Inputs

​Outputs

​Internal Logic

​cluster_dissimilarity

​Description

​Inputs

​Outputs

​Internal Logic

​cluster_dissimilarity_weighted_hamming_distance_min_linkage

​Description

​Inputs

​Outputs

​Internal Logic

​TODOs

High-level description

Code Structure

Symbols

`weighted_hamming_distance`

Description

Inputs

Outputs

Internal Logic

`hamming_similarity_without_missing`

Description

Inputs

Outputs

Internal Logic

`hamming_similarity_normalized_over_missing`

Description

Inputs

Outputs

Internal Logic

`hamming_distance`

Description

Inputs

Outputs

Internal Logic

`weighted_hamming_similarity`

Description

Inputs

Outputs

Internal Logic

`exponential_negative_hamming_distance`

Description

Inputs

Outputs

Internal Logic

`cluster_dissimilarity`

Description

Inputs

Outputs

Internal Logic

`cluster_dissimilarity_weighted_hamming_distance_min_linkage`

Description

Inputs

Outputs

Internal Logic

TODOs