utilities.py - Stanza Demo

High-level description

This file provides utility functions for the Cassiopeia-Preprocess pipeline, including filtering cells and UMIs, error correcting integration barcodes, converting allele tables to character matrices and lineage profiles, and computing empirical indel priors. These functions are used to process and prepare sequencing data for phylogenetic inference.

Code Structure

The file defines several functions that operate on Pandas DataFrames representing molecule tables and allele tables. These functions perform filtering, error correction, and conversion operations to prepare the data for downstream analysis. Some functions are decorated with log_molecule_table and log_runtime decorators to provide logging and runtime information.

References

This file references symbols from the ngs_tools library for sequence manipulation and analysis, as well as the cassiopeia.mixins module for logging and warnings.

Symbols

`filter_cells`

Description

Filters out cell barcodes that have too few UMIs or too few reads per UMI, based on user-defined thresholds.

Inputs

Name	Type	Description
molecule_table	pd.DataFrame	A molecule table of cellBC-UMI pairs to be filtered
min_umi_per_cell	int	Minimum number of UMIs per cell for cell to not be filtered. Defaults to 10.
min_avg_reads_per_umi	float	Minimum average reads per UMI in a cell needed in order for that cell not to be filtered. Defaults to 2.0.

Outputs

Name	Type	Description
filtered_molecule_table	pd.DataFrame	A filtered molecule table

Internal Logic

Groups the molecule table by cell barcode.
Calculates the number of UMIs and average reads per UMI for each cell.
Filters out cells that do not meet the minimum thresholds for both UMIs per cell and average reads per UMI.

`filter_umis`

Description

Filters out UMIs with too few reads, based on a user-defined threshold.

Inputs

Name	Type	Description
molecule_table	pd.DataFrame	A molecule table of cellBC-UMI pairs to be filtered
min_reads_per_umi	int	The minimum read count needed for a UMI to not be filtered. Defaults to 100.

Outputs

Name	Type	Description
filtered_molecule_table	pd.DataFrame	A filtered molecule table

Internal Logic

Filters the molecule table to keep only UMIs with a read count greater than or equal to the minimum threshold.

`error_correct_intbc`

Description

Error corrects close integration barcodes (intBCs) with small enough unique UMI counts.

Inputs

Name	Type	Description
molecule_table	pd.DataFrame	A molecule table of cellBC-UMI pairs to be filtered
prop	float	Proportion by which to filter integration barcodes. Defaults to 0.5.
umi_count_thresh	int	Maximum UMI count for which to correct barcodes. Defaults to 10.
dist_thresh	int	Barcode distance threshold, to decide what’s similar enough to error correct. Defaults to 1.

Outputs

Name	Type	Description
corrected_molecule_table	pd.DataFrame	Filtered molecule table with error corrected intBCs

Internal Logic

Groups the molecule table by cellBC, intBC, and allele.
Iterates through pairs of intBCs within each cellBC-allele group.
Corrects an intBC to another if:
- They have the same allele.
- The Levenshtein distance between their sequences is less than or equal to dist_thresh.
- The number of UMIs of the intBC to be changed is less than or equal to umi_count_thresh.
- The proportion of the UMI count in the intBC to be changed out of the total UMI count in both intBCs is less than prop.

`record_stats`

Description

Records the number of UMIs per intBC and per cellBC, as well as the read counts for each alignment.

Inputs

Name	Type	Description
molecule_table	pd.DataFrame	A DataFrame of alignments

Outputs

Name	Type	Description
read_counts	np.array	Read counts for each alignment
umis_per_intBC	np.array	Number of unique UMIs per intBC
umis_per_cellBC	np.array	Number of UMIs per cellBC

Internal Logic

Groups the molecule table by cellBC and intBC to count UMIs per intBC.
Groups the molecule table by cellBC to count UMIs per cellBC.
Extracts the read counts from the ‘readCount’ column.

`convert_bam_to_df`

Description

Converts a BAM file to a Pandas DataFrame.

Inputs

Name	Type	Description
data_fp	str	The input filepath for the BAM file to be converted

Outputs

Name	Type	Description
df	pd.DataFrame	A Pandas dataframe containing the BAM information

Internal Logic

Opens the BAM file using pysam.AlignmentFile.
Iterates through each alignment in the BAM file.
Extracts relevant information from the alignment, including cellBC, UMI, readCount, grpFlag, sequence, quality scores, and read name.
Creates a Pandas DataFrame from the extracted information.

`convert_alleletable_to_character_matrix`

Description

Converts an AlleleTable into a character matrix, suitable for input into a CassiopeiaSolver object.

Inputs

Name	Type	Description
alleletable	pd.DataFrame	Allele Table to be converted into a character matrix
ignore_intbcs	List[str]	A set of intBCs to ignore
allele_rep_thresh	float	A threshold for removing target sites that have an allele represented by this proportion
missing_data_allele	Optional[str]	Value in the allele table that indicates that the cut-site is missing. This will be converted into `missing_data_state`
missing_data_state	int	A state to use for missing data. Defaults to -1.
mutation_priors	Optional[pd.DataFrame]	A table storing the prior probability of a mutation occurring. This table is used to create a character matrix-specific probability dictionary for reconstruction.
cut_sites	Optional[List[str]]	Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”)
collapse_duplicates	bool	Whether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True.

Outputs

Name	Type	Description
character_matrix	pd.DataFrame	A character matrix
prior_probs	Dict[int, Dict[int, float]]	A probability dictionary
indel_to_charstate	Dict[int, Dict[int, str]]	A dictionary mapping states to the original mutation

Internal Logic

Filters out samples with intBCs in ignore_intbcs.
Iterates through each cellBC and intBC-cut_site combination.
Assigns a unique integer state to each unique indel observed at a cut site.
Constructs the character matrix by mapping cellBCs to their corresponding character states.
Optionally collapses duplicate character states for a single cellBC-intBC pair.
Optionally calculates prior probabilities for each character state based on the provided mutation_priors.

`convert_alleletable_to_lineage_profile`

Description

Converts an AlleleTable to a lineage profile, which is a pivot table over the cellBC / intBCs.

Inputs

Name	Type	Description
allele_table	pd.DataFrame	AlleleTable
cut_sites	Optional[List[str]]	Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”)
collapse_duplicates	bool	Whether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True.

Outputs

Name	Type	Description
lineage_profile	pd.DataFrame	An NxM lineage profile

Internal Logic

Groups the allele table by cellBC and intBC.
Creates a multi-index DataFrame with cellBCs as rows and intBC-cut_site combinations as columns.
Fills the DataFrame with the observed indels for each cellBC-intBC-cut_site combination.
Optionally collapses duplicate character states for a single cellBC-intBC pair.
Reorders the columns based on the frequency of intBCs.

`convert_lineage_profile_to_character_matrix`

Description

Converts a lineage profile to a character matrix, where indels are abstracted into integers.

Inputs

Name	Type	Description
lineage_profile	pd.DataFrame	Lineage profile
indel_priors	Optional[pd.DataFrame]	Dataframe mapping indels to prior probabilities
missing_allele_indicator	Optional[str]	An allele that is being used to represent missing data
missing_state_indicator	int	State to indicate missing data. Defaults to -1.

Outputs

Name	Type	Description
character_matrix	pd.DataFrame	A character matrix
prior_probs	Dict[int, Dict[int, float]]	Prior probability dictionary
indel_to_charstate	Dict[int, Dict[int, str]]	Mapping from character/state pairs to indel identities

Internal Logic

Replaces missing values with “Missing” and optionally replaces a specific missing allele indicator with “Missing”.
Assigns a unique integer state to each unique indel observed at a cut site.
Constructs the character matrix by mapping cellBCs to their corresponding character states.
Optionally calculates prior probabilities for each character state based on the provided indel_priors.

`compute_empirical_indel_priors`

Description

Computes indel prior probabilities from the input allele table.

Inputs

Name	Type	Description
allele_table	pd.DataFrame	AlleleTable
grouping_variables	List[str]	Variables to stratify data by, to treat as independent groups in counting indel occurrences. These must be columns in the allele table. Defaults to [“intBC”].
cut_sites	Optional[List[str]]	Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”)

Outputs

Name	Type	Description
indel_priors	pd.DataFrame	A DataFrame mapping indel identities to the probability

Internal Logic

Groups the allele table by the specified grouping_variables.
Counts the number of times each unique indel occurs in each group.
Calculates the frequency of each indel by dividing its count by the total number of groups.

`get_default_cut_site_columns`

Description

Retrieves the default cut-sites columns of an AlleleTable, assuming the AlleleTable was created using the Cassiopeia pipeline.

Inputs

Name	Type	Description
allele_table	pd.DataFrame	AlleleTable

Outputs

Name	Type	Description
cut_sites	List[str]	Columns in the AlleleTable corresponding to the cut sites

Internal Logic

Selects columns from the allele table that match the pattern “r” (e.g., “r1”, “r2”).

Dependencies

Dependency	Purpose
ngs_tools	Sequence manipulation and analysis
pandas	Data manipulation and analysis
numpy	Numerical operations
pylab	Plotting and visualization
pysam	Reading and manipulating BAM files
re	Regular expression operations
tqdm	Progress bar visualization

Error Handling

The

Project Root

​High-level description

​Code Structure

​References

​Symbols

​filter_cells

​Description

​Inputs

​Outputs

​Internal Logic

​filter_umis

​Description

​Inputs

​Outputs

​Internal Logic

​error_correct_intbc

​Description

​Inputs

​Outputs

​Internal Logic

​record_stats

​Description

​Inputs

​Outputs

​Internal Logic

​convert_bam_to_df

​Description

​Inputs

​Outputs

​Internal Logic

​convert_alleletable_to_character_matrix

​Description

​Inputs

​Outputs

​Internal Logic

​convert_alleletable_to_lineage_profile

​Description

​Inputs

​Outputs

​Internal Logic

​convert_lineage_profile_to_character_matrix

​Description

​Inputs

​Outputs

​Internal Logic

​compute_empirical_indel_priors

​Description

​Inputs

​Outputs

​Internal Logic

​get_default_cut_site_columns

​Description

​Inputs

​Outputs

​Internal Logic

​Dependencies

​Error Handling

High-level description

Code Structure

References

Symbols

`filter_cells`

Description

Inputs

Outputs

Internal Logic

`filter_umis`

Description

Inputs

Outputs

Internal Logic

`error_correct_intbc`

Description

Inputs

Outputs

Internal Logic

`record_stats`

Description

Inputs

Outputs

Internal Logic

`convert_bam_to_df`

Description

Inputs

Outputs

Internal Logic

`convert_alleletable_to_character_matrix`

Description

Inputs

Outputs

Internal Logic

`convert_alleletable_to_lineage_profile`

Description

Inputs

Outputs

Internal Logic

`convert_lineage_profile_to_character_matrix`

Description

Inputs

Outputs

Internal Logic

`compute_empirical_indel_priors`

Description

Inputs

Outputs

Internal Logic

`get_default_cut_site_columns`

Description

Inputs

Outputs

Internal Logic

Dependencies

Error Handling