missing_data_methods.py
High-level description
The assign_missing_average
function is a missing data imputation method used in the Cassiopeia-Greedy algorithm. It aims to assign samples with missing data at a particular character to the most similar partition (left or right) based on the average number of shared mutations. This function is particularly useful for handling ambiguous character states, such as those arising from uncertainties in single-cell sequencing data.
Symbols
assign_missing_average
Description
This function takes a character matrix, partition information, and a list of samples with missing data. It calculates a score for each missing sample against both the left and right partitions, based on the average number of shared mutations. The sample is then assigned to the partition with the higher score.
Inputs
Name | Type | Description |
---|---|---|
character_matrix | pandas.DataFrame | The character matrix containing observed character states for all samples. |
missing_state_indicator | int | The character representing missing values in the character matrix. |
left_set | List[str] | A list of sample names currently assigned to the left partition. |
right_set | List[str] | A list of sample names currently assigned to the right partition. |
missing | List[str] | A list of sample names with missing data at the current character being considered. |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for character/state mutation pairs, derived from priors. |
Outputs
Name | Type | Description |
---|---|---|
left_set | List[str] | The updated left partition with imputed missing samples. |
right_set | List[str] | The updated right partition with imputed missing samples. |
Internal Logic
- Convert sample names to indices: For efficient indexing, the function first converts the sample names in
left_set
,right_set
, andmissing
to their corresponding integer indices in thecharacter_matrix
. - Define
score_side
helper function: This function calculates the similarity score between a missing sample and a given partition (left or right). It iterates through each character, comparing the states of the missing sample to the states present in the partition. The score is calculated based on the number of shared states, optionally weighted by the providedweights
. - Calculate scores for each missing sample: For each sample in
missing
, the function calculates its similarity score to both the left and right partitions using thescore_side
function. - Assign missing samples based on scores: The missing sample is assigned to the partition with the higher average similarity score. The average score is calculated by dividing the total score by the number of samples in the respective partition.
- Return updated partitions: The function returns the updated
left_set
andright_set
lists, now including the imputed missing samples.
References
solver_utilities.convert_sample_names_to_indices
: Used to convert sample names to their corresponding indices in the character matrix.unravel_ambiguous_states
: Imported fromcassiopeia.mixins
and used to handle ambiguous character states, expanding them into a list of all possible states.
Side Effects
This function modifies the input left_set
and right_set
lists by adding the imputed missing samples to the appropriate partition.
Performance Considerations
The function first converts sample names to indices to improve the efficiency of accessing elements in the character matrix. The use of NumPy arrays for storing and comparing character states further enhances performance.