High-level description

The assign_missing_average function is a missing data imputation method used in the Cassiopeia-Greedy algorithm. It aims to assign samples with missing data at a particular character to the most similar partition (left or right) based on the average number of shared mutations. This function is particularly useful for handling ambiguous character states, such as those arising from uncertainties in single-cell sequencing data.

Symbols

assign_missing_average

Description

This function takes a character matrix, partition information, and a list of samples with missing data. It calculates a score for each missing sample against both the left and right partitions, based on the average number of shared mutations. The sample is then assigned to the partition with the higher score.

Inputs

NameTypeDescription
character_matrixpandas.DataFrameThe character matrix containing observed character states for all samples.
missing_state_indicatorintThe character representing missing values in the character matrix.
left_setList[str]A list of sample names currently assigned to the left partition.
right_setList[str]A list of sample names currently assigned to the right partition.
missingList[str]A list of sample names with missing data at the current character being considered.
weightsOptional[Dict[int, Dict[int, float]]]Optional weights for character/state mutation pairs, derived from priors.

Outputs

NameTypeDescription
left_setList[str]The updated left partition with imputed missing samples.
right_setList[str]The updated right partition with imputed missing samples.

Internal Logic

  1. Convert sample names to indices: For efficient indexing, the function first converts the sample names in left_set, right_set, and missing to their corresponding integer indices in the character_matrix.
  2. Define score_side helper function: This function calculates the similarity score between a missing sample and a given partition (left or right). It iterates through each character, comparing the states of the missing sample to the states present in the partition. The score is calculated based on the number of shared states, optionally weighted by the provided weights.
  3. Calculate scores for each missing sample: For each sample in missing, the function calculates its similarity score to both the left and right partitions using the score_side function.
  4. Assign missing samples based on scores: The missing sample is assigned to the partition with the higher average similarity score. The average score is calculated by dividing the total score by the number of samples in the respective partition.
  5. Return updated partitions: The function returns the updated left_set and right_set lists, now including the imputed missing samples.

References

  • solver_utilities.convert_sample_names_to_indices: Used to convert sample names to their corresponding indices in the character matrix.
  • unravel_ambiguous_states: Imported from cassiopeia.mixins and used to handle ambiguous character states, expanding them into a list of all possible states.

Side Effects

This function modifies the input left_set and right_set lists by adding the imputed missing samples to the appropriate partition.

Performance Considerations

The function first converts sample names to indices to improve the efficiency of accessing elements in the character matrix. The use of NumPy arrays for storing and comparing character states further enhances performance.