High-level description
This code defines a functionmap_intbcs
that processes a molecule table to resolve allele ambiguity for each cellBC/intBC pair. It identifies the most frequent allele based on read count and UMI count, then filters out alignments with other alleles for each pair.
Symbols
map_intbcs
Description
This function takes a molecule table as input and returns a modified allele table where each cellBC/intBC pair is associated with a single allele. It achieves this by grouping the input table by cellBC, intBC, and allele, then selecting the allele with the highest read count (and UMI count as a tie-breaker) for each group. Alignments with other alleles for the same cellBC/intBC pair are removed.Inputs
Name | Type | Description |
---|---|---|
molecule_table | pandas.DataFrame | A molecule table containing cellBC, intBC, allele, readCount, and UMI information. |
Outputs
Name | Type | Description |
---|---|---|
mapped_table | pandas.DataFrame | A modified allele table where each cellBC/intBC pair is associated with a single allele. |
Internal Logic
- Drops rows with missing
intBC
values from the inputmolecule_table
. - Groups the remaining rows by
cellBC
,intBC
, andallele
. - Aggregates the groups by summing
readCount
and countingUMI
. - Sorts the aggregated table first by
UMI
in descending order, then byreadCount
in descending order. - Identifies duplicate
cellBC
/intBC
pairs, keeping only the first occurrence (which corresponds to the highest read and UMI count). - Creates a set of tuples representing the selected
cellBC
,intBC
, andallele
combinations. - Filters the original
molecule_table
to keep only rows where thecellBC
,intBC
, andallele
combination exists in the set created in step 6. - Logs the number of removed alleles and UMIs during the filtering process.
- Returns the filtered
molecule_table
.
References
logger
: Used for logging debug messages. Imported fromcassiopeia.mixins
.utilities.log_molecule_table
: A decorator function used to log statistics of the molecule table before and after the function execution. Imported fromcassiopeia.preprocess
.
Side Effects
- Logs debug messages indicating the number of alleles and UMIs removed during the filtering process.