map_utils.py
High-level description
This code defines a function map_intbcs
that processes a molecule table to resolve allele ambiguity for each cellBC/intBC pair. It identifies the most frequent allele based on read count and UMI count, then filters out alignments with other alleles for each pair.
Symbols
map_intbcs
Description
This function takes a molecule table as input and returns a modified allele table where each cellBC/intBC pair is associated with a single allele. It achieves this by grouping the input table by cellBC, intBC, and allele, then selecting the allele with the highest read count (and UMI count as a tie-breaker) for each group. Alignments with other alleles for the same cellBC/intBC pair are removed.
Inputs
Name | Type | Description |
---|---|---|
molecule_table | pandas.DataFrame | A molecule table containing cellBC, intBC, allele, readCount, and UMI information. |
Outputs
Name | Type | Description |
---|---|---|
mapped_table | pandas.DataFrame | A modified allele table where each cellBC/intBC pair is associated with a single allele. |
Internal Logic
- Drops rows with missing
intBC
values from the inputmolecule_table
. - Groups the remaining rows by
cellBC
,intBC
, andallele
. - Aggregates the groups by summing
readCount
and countingUMI
. - Sorts the aggregated table first by
UMI
in descending order, then byreadCount
in descending order. - Identifies duplicate
cellBC
/intBC
pairs, keeping only the first occurrence (which corresponds to the highest read and UMI count). - Creates a set of tuples representing the selected
cellBC
,intBC
, andallele
combinations. - Filters the original
molecule_table
to keep only rows where thecellBC
,intBC
, andallele
combination exists in the set created in step 6. - Logs the number of removed alleles and UMIs during the filtering process.
- Returns the filtered
molecule_table
.
References
logger
: Used for logging debug messages. Imported fromcassiopeia.mixins
.utilities.log_molecule_table
: A decorator function used to log statistics of the molecule table before and after the function execution. Imported fromcassiopeia.preprocess
.
Side Effects
- Logs debug messages indicating the number of alleles and UMIs removed during the filtering process.
Performance Considerations
The function uses pandas groupby and apply operations, which can be computationally expensive for large datasets. Consider optimizing these operations or using alternative data structures if performance becomes a bottleneck.