High-level description

This file provides a collection of dissimilarity and similarity functions used for comparing phylogenetic samples. These functions calculate the distance or similarity between pairs of samples based on their character states, considering factors like missing data, state weights, and linkage methods.

Code Structure

This code defines several functions for calculating dissimilarity and similarity between phylogenetic samples. Some functions are optimized using the numba library. The cluster_dissimilarity function utilizes other dissimilarity functions and linkage functions to handle ambiguous character strings.

Symbols

weighted_hamming_distance

Description

Calculates the weighted Hamming distance between two phylogenetic samples. It considers shared indel states and their probabilities, incrementing dissimilarity for differing states and decrementing for identical states based on their weights.

Inputs

NameTypeDescription
s1List[int]Character states of the first sample
s2List[int]Character states of the second sample
missing_state_indicatorintThe character representing missing values (default: -1)
weightsOptional[Dict[int, Dict[int, float]]]A nested dictionary storing state weights for each character, derived from state priors (default: None)

Outputs

NameTypeDescription
distancefloatThe weighted Hamming distance between the two samples

Internal Logic

The function iterates through each character of the samples. If both states are identical and not missing, it decrements the dissimilarity by the probability of them occurring independently. If the states disagree, it increments the dissimilarity by the probability of those states occurring. The final dissimilarity is normalized by the number of non-missing characters shared by the samples.

hamming_similarity_without_missing

Description

Calculates the number of shared (non-missing) character/state mutations between two samples.

Inputs

NameTypeDescription
s1List[int]Character states of the first sample
s2List[int]Character states of the second sample
missing_state_indicatorintThe character representing missing values
weightsOptional[Dict[int, Dict[int, float]]]Optional weights to weight the similarity of a mutation (default: None)

Outputs

NameTypeDescription
similarityfloatThe number of shared mutations, weighted or unweighted

Internal Logic

The function iterates through each character, ignoring missing states and uncut states (state 0). For each matching state, it increments the similarity by 1 or by the corresponding weight if provided.

hamming_similarity_normalized_over_missing

Description

Calculates the number of shared (non-missing) character/state mutations between two samples, normalized by the number of missing data events.

Inputs

NameTypeDescription
s1List[int]Character states of the first sample
s2List[int]Character states of the second sample
missing_state_indicatorintThe character representing missing values
weightsOptional[Dict[int, Dict[int, float]]]Optional weights to weight the similarity of a mutation (default: None)

Outputs

NameTypeDescription
similarityfloatThe normalized number of shared mutations, weighted or unweighted

Internal Logic

Similar to hamming_similarity_without_missing, but it also tracks the number of non-missing characters (num_present). The final similarity score is divided by num_present to normalize for missing data.

hamming_distance

Description

Calculates the vanilla Hamming distance between two samples, optionally ignoring missing data.

Inputs

NameTypeDescription
s1np.array(int)The first sample
s2np.array(int)The second sample
ignore_missing_stateboolWhether to ignore missing states in the comparison (default: False)
missing_state_indicatorintThe character representing missing values (default: -1)

Outputs

NameTypeDescription
distanceintThe Hamming distance between the two samples

Internal Logic

The function iterates through each character, incrementing the distance for each mismatch unless both states are missing and ignore_missing_state is True.

weighted_hamming_similarity

Description

Calculates the weighted number of shared (non-missing) character/state mutations between two samples.

Inputs

NameTypeDescription
s1List[int]Character states of the first sample
s2List[int]Character states of the second sample
missing_state_indicatorintThe character representing missing values
weightsOptional[Dict[int, Dict[int, float]]]Optional weights to weight the similarity of a mutation (default: None)

Outputs

NameTypeDescription
similarityfloatThe weighted number of shared mutations

Internal Logic

The function iterates through each character, ignoring missing states. For each matching state, it increments the similarity by 2 times the corresponding weight (or 2 if no weights are provided) if the state is not 0, and by 1 if the state is 0 and no weights are provided. The final similarity is normalized by the number of non-missing characters.

exponential_negative_hamming_distance

Description

Calculates a similarity score based on the inverse of the weighted Hamming distance. It uses the formula exp(-d(i,j)), where d is the weighted Hamming distance.

Inputs

NameTypeDescription
s1List[int]Character states of the first sample
s2List[int]Character states of the second sample
missing_state_indicatorintThe character representing missing values (default: -1)
weightsOptional[Dict[int, Dict[int, float]]]A nested dictionary storing state weights for each character (default: None)

Outputs

NameTypeDescription
similarityfloatThe similarity score

Internal Logic

The function first calculates the weighted Hamming distance (weighted_hamm_dist) using the same logic as weighted_hamming_distance. Then, it returns the exponential of the negative of this distance.

cluster_dissimilarity

Description

Calculates the dissimilarity between two potentially ambiguous character strings. An ambiguous character string can have multiple possible states for each character.

Inputs

NameTypeDescription
dissimilarity_functionCallable[[List[int], List[int], int, Dict[int, Dict[int, float]]], float]The dissimilarity function to use for pairwise comparisons
s1Union[List[int], List[Tuple[int, …]]]The first (possibly ambiguous) sample
s2Union[List[int], List[Tuple[int, …]]]The second (possibly ambiguous) sample
missing_state_indicatorintThe character representing missing values
weightsOptional[Dict[int, Dict[int, float]]]Optional weights for the dissimilarity function (default: None)
linkage_functionCallable[[Union[np.array, List[float]]], float]The linkage function to aggregate dissimilarities (default: np.mean)
normalizeboolWhether to normalize the dissimilarity by the proportion of shared non-missing positions (default: True)

Outputs

NameTypeDescription
dissimilarityfloatThe dissimilarity between the two samples

Internal Logic

The function iterates through each character of the input strings. If a character is ambiguous (represented as a tuple), it calculates the dissimilarity for all possible combinations of states using the provided dissimilarity_function and aggregates them using the linkage_function. The final dissimilarity is the sum of individual character dissimilarities, optionally normalized by the number of shared non-missing positions.

cluster_dissimilarity_weighted_hamming_distance_min_linkage

Description

Calculates the dissimilarity between two potentially ambiguous character strings using the weighted Hamming distance and minimum linkage.

Inputs

NameTypeDescription
s1Union[List[int], List[Tuple[int, …]]]The first (possibly ambiguous) sample
s2Union[List[int], List[Tuple[int, …]]]The second (possibly ambiguous) sample
missing_state_indicatorintThe character representing missing values
weightsOptional[Dict[int, Dict[int, float]]]Optional weights for the weighted Hamming distance (default: None)

Outputs

NameTypeDescription
dissimilarityfloatThe dissimilarity between the two samples

Internal Logic

This function is similar to cluster_dissimilarity but specifically uses the weighted Hamming distance as the dissimilarity function and minimum linkage (np.min) to aggregate dissimilarities. It iterates through each character, calculates the weighted Hamming distance for all possible state combinations if the character is ambiguous, and takes the minimum distance. The final dissimilarity is normalized by the number of shared non-missing positions.

TODOs

  • hamming_similarity_without_missing: Optimize this using masks
  • hamming_similarity_normalized_over_missing: Optimize this using masks