High-level description
This file provides utility functions for the Cassiopeia-Preprocess pipeline, including filtering cells and UMIs, error correcting integration barcodes, converting allele tables to character matrices and lineage profiles, and computing empirical indel priors. These functions are used to process and prepare sequencing data for phylogenetic inference.
Code Structure
The file defines several functions that operate on Pandas DataFrames representing molecule tables and allele tables. These functions perform filtering, error correction, and conversion operations to prepare the data for downstream analysis. Some functions are decorated with log_molecule_table
and log_runtime
decorators to provide logging and runtime information.
References
This file references symbols from the ngs_tools
library for sequence manipulation and analysis, as well as the cassiopeia.mixins
module for logging and warnings.
Symbols
filter_cells
Description
Filters out cell barcodes that have too few UMIs or too few reads per UMI, based on user-defined thresholds.
Name | Type | Description |
---|
molecule_table | pd.DataFrame | A molecule table of cellBC-UMI pairs to be filtered |
min_umi_per_cell | int | Minimum number of UMIs per cell for cell to not be filtered. Defaults to 10. |
min_avg_reads_per_umi | float | Minimum average reads per UMI in a cell needed in order for that cell not to be filtered. Defaults to 2.0. |
Outputs
Name | Type | Description |
---|
filtered_molecule_table | pd.DataFrame | A filtered molecule table |
Internal Logic
- Groups the molecule table by cell barcode.
- Calculates the number of UMIs and average reads per UMI for each cell.
- Filters out cells that do not meet the minimum thresholds for both UMIs per cell and average reads per UMI.
filter_umis
Description
Filters out UMIs with too few reads, based on a user-defined threshold.
Name | Type | Description |
---|
molecule_table | pd.DataFrame | A molecule table of cellBC-UMI pairs to be filtered |
min_reads_per_umi | int | The minimum read count needed for a UMI to not be filtered. Defaults to 100. |
Outputs
Name | Type | Description |
---|
filtered_molecule_table | pd.DataFrame | A filtered molecule table |
Internal Logic
- Filters the molecule table to keep only UMIs with a read count greater than or equal to the minimum threshold.
error_correct_intbc
Description
Error corrects close integration barcodes (intBCs) with small enough unique UMI counts.
Name | Type | Description |
---|
molecule_table | pd.DataFrame | A molecule table of cellBC-UMI pairs to be filtered |
prop | float | Proportion by which to filter integration barcodes. Defaults to 0.5. |
umi_count_thresh | int | Maximum UMI count for which to correct barcodes. Defaults to 10. |
dist_thresh | int | Barcode distance threshold, to decide what’s similar enough to error correct. Defaults to 1. |
Outputs
Name | Type | Description |
---|
corrected_molecule_table | pd.DataFrame | Filtered molecule table with error corrected intBCs |
Internal Logic
- Groups the molecule table by cellBC, intBC, and allele.
- Iterates through pairs of intBCs within each cellBC-allele group.
- Corrects an intBC to another if:
- They have the same allele.
- The Levenshtein distance between their sequences is less than or equal to
dist_thresh
.
- The number of UMIs of the intBC to be changed is less than or equal to
umi_count_thresh
.
- The proportion of the UMI count in the intBC to be changed out of the total UMI count in both intBCs is less than
prop
.
record_stats
Description
Records the number of UMIs per intBC and per cellBC, as well as the read counts for each alignment.
Name | Type | Description |
---|
molecule_table | pd.DataFrame | A DataFrame of alignments |
Outputs
Name | Type | Description |
---|
read_counts | np.array | Read counts for each alignment |
umis_per_intBC | np.array | Number of unique UMIs per intBC |
umis_per_cellBC | np.array | Number of UMIs per cellBC |
Internal Logic
- Groups the molecule table by cellBC and intBC to count UMIs per intBC.
- Groups the molecule table by cellBC to count UMIs per cellBC.
- Extracts the read counts from the ‘readCount’ column.
convert_bam_to_df
Description
Converts a BAM file to a Pandas DataFrame.
Name | Type | Description |
---|
data_fp | str | The input filepath for the BAM file to be converted |
Outputs
Name | Type | Description |
---|
df | pd.DataFrame | A Pandas dataframe containing the BAM information |
Internal Logic
- Opens the BAM file using
pysam.AlignmentFile
.
- Iterates through each alignment in the BAM file.
- Extracts relevant information from the alignment, including cellBC, UMI, readCount, grpFlag, sequence, quality scores, and read name.
- Creates a Pandas DataFrame from the extracted information.
convert_alleletable_to_character_matrix
Description
Converts an AlleleTable into a character matrix, suitable for input into a CassiopeiaSolver object.
Name | Type | Description |
---|
alleletable | pd.DataFrame | Allele Table to be converted into a character matrix |
ignore_intbcs | List[str] | A set of intBCs to ignore |
allele_rep_thresh | float | A threshold for removing target sites that have an allele represented by this proportion |
missing_data_allele | Optional[str] | Value in the allele table that indicates that the cut-site is missing. This will be converted into missing_data_state |
missing_data_state | int | A state to use for missing data. Defaults to -1. |
mutation_priors | Optional[pd.DataFrame] | A table storing the prior probability of a mutation occurring. This table is used to create a character matrix-specific probability dictionary for reconstruction. |
cut_sites | Optional[List[str]] | Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”) |
collapse_duplicates | bool | Whether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True. |
Outputs
Name | Type | Description |
---|
character_matrix | pd.DataFrame | A character matrix |
prior_probs | Dict[int, Dict[int, float]] | A probability dictionary |
indel_to_charstate | Dict[int, Dict[int, str]] | A dictionary mapping states to the original mutation |
Internal Logic
- Filters out samples with intBCs in
ignore_intbcs
.
- Iterates through each cellBC and intBC-cut_site combination.
- Assigns a unique integer state to each unique indel observed at a cut site.
- Constructs the character matrix by mapping cellBCs to their corresponding character states.
- Optionally collapses duplicate character states for a single cellBC-intBC pair.
- Optionally calculates prior probabilities for each character state based on the provided
mutation_priors
.
convert_alleletable_to_lineage_profile
Description
Converts an AlleleTable to a lineage profile, which is a pivot table over the cellBC / intBCs.
Name | Type | Description |
---|
allele_table | pd.DataFrame | AlleleTable |
cut_sites | Optional[List[str]] | Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”) |
collapse_duplicates | bool | Whether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True. |
Outputs
Name | Type | Description |
---|
lineage_profile | pd.DataFrame | An NxM lineage profile |
Internal Logic
- Groups the allele table by cellBC and intBC.
- Creates a multi-index DataFrame with cellBCs as rows and intBC-cut_site combinations as columns.
- Fills the DataFrame with the observed indels for each cellBC-intBC-cut_site combination.
- Optionally collapses duplicate character states for a single cellBC-intBC pair.
- Reorders the columns based on the frequency of intBCs.
convert_lineage_profile_to_character_matrix
Description
Converts a lineage profile to a character matrix, where indels are abstracted into integers.
Name | Type | Description |
---|
lineage_profile | pd.DataFrame | Lineage profile |
indel_priors | Optional[pd.DataFrame] | Dataframe mapping indels to prior probabilities |
missing_allele_indicator | Optional[str] | An allele that is being used to represent missing data |
missing_state_indicator | int | State to indicate missing data. Defaults to -1. |
Outputs
Name | Type | Description |
---|
character_matrix | pd.DataFrame | A character matrix |
prior_probs | Dict[int, Dict[int, float]] | Prior probability dictionary |
indel_to_charstate | Dict[int, Dict[int, str]] | Mapping from character/state pairs to indel identities |
Internal Logic
- Replaces missing values with “Missing” and optionally replaces a specific missing allele indicator with “Missing”.
- Assigns a unique integer state to each unique indel observed at a cut site.
- Constructs the character matrix by mapping cellBCs to their corresponding character states.
- Optionally calculates prior probabilities for each character state based on the provided
indel_priors
.
compute_empirical_indel_priors
Description
Computes indel prior probabilities from the input allele table.
Name | Type | Description |
---|
allele_table | pd.DataFrame | AlleleTable |
grouping_variables | List[str] | Variables to stratify data by, to treat as independent groups in counting indel occurrences. These must be columns in the allele table. Defaults to [“intBC”]. |
cut_sites | Optional[List[str]] | Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”) |
Outputs
Name | Type | Description |
---|
indel_priors | pd.DataFrame | A DataFrame mapping indel identities to the probability |
Internal Logic
- Groups the allele table by the specified
grouping_variables
.
- Counts the number of times each unique indel occurs in each group.
- Calculates the frequency of each indel by dividing its count by the total number of groups.
get_default_cut_site_columns
Description
Retrieves the default cut-sites columns of an AlleleTable, assuming the AlleleTable was created using the Cassiopeia pipeline.
Name | Type | Description |
---|
allele_table | pd.DataFrame | AlleleTable |
Outputs
Name | Type | Description |
---|
cut_sites | List[str] | Columns in the AlleleTable corresponding to the cut sites |
Internal Logic
- Selects columns from the allele table that match the pattern “r” (e.g., “r1”, “r2”).
Dependencies
Dependency | Purpose |
---|
ngs_tools | Sequence manipulation and analysis |
pandas | Data manipulation and analysis |
numpy | Numerical operations |
pylab | Plotting and visualization |
pysam | Reading and manipulating BAM files |
re | Regular expression operations |
tqdm | Progress bar visualization |
Error Handling
The