dissimilarity_functions.py
High-level description
This file provides a collection of dissimilarity and similarity functions used for comparing phylogenetic samples. These functions calculate the distance or similarity between pairs of samples based on their character states, considering factors like missing data, state weights, and linkage methods.
Code Structure
This code defines several functions for calculating dissimilarity and similarity between phylogenetic samples. Some functions are optimized using the numba
library. The cluster_dissimilarity
function utilizes other dissimilarity functions and linkage functions to handle ambiguous character strings.
Symbols
weighted_hamming_distance
Description
Calculates the weighted Hamming distance between two phylogenetic samples. It considers shared indel states and their probabilities, incrementing dissimilarity for differing states and decrementing for identical states based on their weights.
Inputs
Name | Type | Description |
---|---|---|
s1 | List[int] | Character states of the first sample |
s2 | List[int] | Character states of the second sample |
missing_state_indicator | int | The character representing missing values (default: -1) |
weights | Optional[Dict[int, Dict[int, float]]] | A nested dictionary storing state weights for each character, derived from state priors (default: None) |
Outputs
Name | Type | Description |
---|---|---|
distance | float | The weighted Hamming distance between the two samples |
Internal Logic
The function iterates through each character of the samples. If both states are identical and not missing, it decrements the dissimilarity by the probability of them occurring independently. If the states disagree, it increments the dissimilarity by the probability of those states occurring. The final dissimilarity is normalized by the number of non-missing characters shared by the samples.
hamming_similarity_without_missing
Description
Calculates the number of shared (non-missing) character/state mutations between two samples.
Inputs
Name | Type | Description |
---|---|---|
s1 | List[int] | Character states of the first sample |
s2 | List[int] | Character states of the second sample |
missing_state_indicator | int | The character representing missing values |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights to weight the similarity of a mutation (default: None) |
Outputs
Name | Type | Description |
---|---|---|
similarity | float | The number of shared mutations, weighted or unweighted |
Internal Logic
The function iterates through each character, ignoring missing states and uncut states (state 0). For each matching state, it increments the similarity by 1 or by the corresponding weight if provided.
hamming_similarity_normalized_over_missing
Description
Calculates the number of shared (non-missing) character/state mutations between two samples, normalized by the number of missing data events.
Inputs
Name | Type | Description |
---|---|---|
s1 | List[int] | Character states of the first sample |
s2 | List[int] | Character states of the second sample |
missing_state_indicator | int | The character representing missing values |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights to weight the similarity of a mutation (default: None) |
Outputs
Name | Type | Description |
---|---|---|
similarity | float | The normalized number of shared mutations, weighted or unweighted |
Internal Logic
Similar to hamming_similarity_without_missing
, but it also tracks the number of non-missing characters (num_present
). The final similarity score is divided by num_present
to normalize for missing data.
hamming_distance
Description
Calculates the vanilla Hamming distance between two samples, optionally ignoring missing data.
Inputs
Name | Type | Description |
---|---|---|
s1 | np.array(int) | The first sample |
s2 | np.array(int) | The second sample |
ignore_missing_state | bool | Whether to ignore missing states in the comparison (default: False) |
missing_state_indicator | int | The character representing missing values (default: -1) |
Outputs
Name | Type | Description |
---|---|---|
distance | int | The Hamming distance between the two samples |
Internal Logic
The function iterates through each character, incrementing the distance for each mismatch unless both states are missing and ignore_missing_state
is True.
weighted_hamming_similarity
Description
Calculates the weighted number of shared (non-missing) character/state mutations between two samples.
Inputs
Name | Type | Description |
---|---|---|
s1 | List[int] | Character states of the first sample |
s2 | List[int] | Character states of the second sample |
missing_state_indicator | int | The character representing missing values |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights to weight the similarity of a mutation (default: None) |
Outputs
Name | Type | Description |
---|---|---|
similarity | float | The weighted number of shared mutations |
Internal Logic
The function iterates through each character, ignoring missing states. For each matching state, it increments the similarity by 2 times the corresponding weight (or 2 if no weights are provided) if the state is not 0, and by 1 if the state is 0 and no weights are provided. The final similarity is normalized by the number of non-missing characters.
exponential_negative_hamming_distance
Description
Calculates a similarity score based on the inverse of the weighted Hamming distance. It uses the formula exp(-d(i,j)), where d is the weighted Hamming distance.
Inputs
Name | Type | Description |
---|---|---|
s1 | List[int] | Character states of the first sample |
s2 | List[int] | Character states of the second sample |
missing_state_indicator | int | The character representing missing values (default: -1) |
weights | Optional[Dict[int, Dict[int, float]]] | A nested dictionary storing state weights for each character (default: None) |
Outputs
Name | Type | Description |
---|---|---|
similarity | float | The similarity score |
Internal Logic
The function first calculates the weighted Hamming distance (weighted_hamm_dist
) using the same logic as weighted_hamming_distance
. Then, it returns the exponential of the negative of this distance.
cluster_dissimilarity
Description
Calculates the dissimilarity between two potentially ambiguous character strings. An ambiguous character string can have multiple possible states for each character.
Inputs
Name | Type | Description |
---|---|---|
dissimilarity_function | Callable[[List[int], List[int], int, Dict[int, Dict[int, float]]], float] | The dissimilarity function to use for pairwise comparisons |
s1 | Union[List[int], List[Tuple[int, …]]] | The first (possibly ambiguous) sample |
s2 | Union[List[int], List[Tuple[int, …]]] | The second (possibly ambiguous) sample |
missing_state_indicator | int | The character representing missing values |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for the dissimilarity function (default: None) |
linkage_function | Callable[[Union[np.array, List[float]]], float] | The linkage function to aggregate dissimilarities (default: np.mean) |
normalize | bool | Whether to normalize the dissimilarity by the proportion of shared non-missing positions (default: True) |
Outputs
Name | Type | Description |
---|---|---|
dissimilarity | float | The dissimilarity between the two samples |
Internal Logic
The function iterates through each character of the input strings. If a character is ambiguous (represented as a tuple), it calculates the dissimilarity for all possible combinations of states using the provided dissimilarity_function
and aggregates them using the linkage_function
. The final dissimilarity is the sum of individual character dissimilarities, optionally normalized by the number of shared non-missing positions.
cluster_dissimilarity_weighted_hamming_distance_min_linkage
Description
Calculates the dissimilarity between two potentially ambiguous character strings using the weighted Hamming distance and minimum linkage.
Inputs
Name | Type | Description |
---|---|---|
s1 | Union[List[int], List[Tuple[int, …]]] | The first (possibly ambiguous) sample |
s2 | Union[List[int], List[Tuple[int, …]]] | The second (possibly ambiguous) sample |
missing_state_indicator | int | The character representing missing values |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for the weighted Hamming distance (default: None) |
Outputs
Name | Type | Description |
---|---|---|
dissimilarity | float | The dissimilarity between the two samples |
Internal Logic
This function is similar to cluster_dissimilarity
but specifically uses the weighted Hamming distance as the dissimilarity function and minimum linkage (np.min) to aggregate dissimilarities. It iterates through each character, calculates the weighted Hamming distance for all possible state combinations if the character is ambiguous, and takes the minimum distance. The final dissimilarity is normalized by the number of shared non-missing positions.
TODOs
hamming_similarity_without_missing
: Optimize this using maskshamming_similarity_normalized_over_missing
: Optimize this using masks