High-level description

This file provides utility functions for the Cassiopeia-Preprocess pipeline, including filtering cells and UMIs, error correcting integration barcodes, converting allele tables to character matrices and lineage profiles, and computing empirical indel priors. These functions are used to process and prepare sequencing data for phylogenetic inference.

Code Structure

The file defines several functions that operate on Pandas DataFrames representing molecule tables and allele tables. These functions perform filtering, error correction, and conversion operations to prepare the data for downstream analysis. Some functions are decorated with log_molecule_table and log_runtime decorators to provide logging and runtime information.

References

This file references symbols from the ngs_tools library for sequence manipulation and analysis, as well as the cassiopeia.mixins module for logging and warnings.

Symbols

filter_cells

Description

Filters out cell barcodes that have too few UMIs or too few reads per UMI, based on user-defined thresholds.

Inputs

NameTypeDescription
molecule_tablepd.DataFrameA molecule table of cellBC-UMI pairs to be filtered
min_umi_per_cellintMinimum number of UMIs per cell for cell to not be filtered. Defaults to 10.
min_avg_reads_per_umifloatMinimum average reads per UMI in a cell needed in order for that cell not to be filtered. Defaults to 2.0.

Outputs

NameTypeDescription
filtered_molecule_tablepd.DataFrameA filtered molecule table

Internal Logic

  1. Groups the molecule table by cell barcode.
  2. Calculates the number of UMIs and average reads per UMI for each cell.
  3. Filters out cells that do not meet the minimum thresholds for both UMIs per cell and average reads per UMI.

filter_umis

Description

Filters out UMIs with too few reads, based on a user-defined threshold.

Inputs

NameTypeDescription
molecule_tablepd.DataFrameA molecule table of cellBC-UMI pairs to be filtered
min_reads_per_umiintThe minimum read count needed for a UMI to not be filtered. Defaults to 100.

Outputs

NameTypeDescription
filtered_molecule_tablepd.DataFrameA filtered molecule table

Internal Logic

  1. Filters the molecule table to keep only UMIs with a read count greater than or equal to the minimum threshold.

error_correct_intbc

Description

Error corrects close integration barcodes (intBCs) with small enough unique UMI counts.

Inputs

NameTypeDescription
molecule_tablepd.DataFrameA molecule table of cellBC-UMI pairs to be filtered
propfloatProportion by which to filter integration barcodes. Defaults to 0.5.
umi_count_threshintMaximum UMI count for which to correct barcodes. Defaults to 10.
dist_threshintBarcode distance threshold, to decide what’s similar enough to error correct. Defaults to 1.

Outputs

NameTypeDescription
corrected_molecule_tablepd.DataFrameFiltered molecule table with error corrected intBCs

Internal Logic

  1. Groups the molecule table by cellBC, intBC, and allele.
  2. Iterates through pairs of intBCs within each cellBC-allele group.
  3. Corrects an intBC to another if:
    • They have the same allele.
    • The Levenshtein distance between their sequences is less than or equal to dist_thresh.
    • The number of UMIs of the intBC to be changed is less than or equal to umi_count_thresh.
    • The proportion of the UMI count in the intBC to be changed out of the total UMI count in both intBCs is less than prop.

record_stats

Description

Records the number of UMIs per intBC and per cellBC, as well as the read counts for each alignment.

Inputs

NameTypeDescription
molecule_tablepd.DataFrameA DataFrame of alignments

Outputs

NameTypeDescription
read_countsnp.arrayRead counts for each alignment
umis_per_intBCnp.arrayNumber of unique UMIs per intBC
umis_per_cellBCnp.arrayNumber of UMIs per cellBC

Internal Logic

  1. Groups the molecule table by cellBC and intBC to count UMIs per intBC.
  2. Groups the molecule table by cellBC to count UMIs per cellBC.
  3. Extracts the read counts from the ‘readCount’ column.

convert_bam_to_df

Description

Converts a BAM file to a Pandas DataFrame.

Inputs

NameTypeDescription
data_fpstrThe input filepath for the BAM file to be converted

Outputs

NameTypeDescription
dfpd.DataFrameA Pandas dataframe containing the BAM information

Internal Logic

  1. Opens the BAM file using pysam.AlignmentFile.
  2. Iterates through each alignment in the BAM file.
  3. Extracts relevant information from the alignment, including cellBC, UMI, readCount, grpFlag, sequence, quality scores, and read name.
  4. Creates a Pandas DataFrame from the extracted information.

convert_alleletable_to_character_matrix

Description

Converts an AlleleTable into a character matrix, suitable for input into a CassiopeiaSolver object.

Inputs

NameTypeDescription
alleletablepd.DataFrameAllele Table to be converted into a character matrix
ignore_intbcsList[str]A set of intBCs to ignore
allele_rep_threshfloatA threshold for removing target sites that have an allele represented by this proportion
missing_data_alleleOptional[str]Value in the allele table that indicates that the cut-site is missing. This will be converted into missing_data_state
missing_data_stateintA state to use for missing data. Defaults to -1.
mutation_priorsOptional[pd.DataFrame]A table storing the prior probability of a mutation occurring. This table is used to create a character matrix-specific probability dictionary for reconstruction.
cut_sitesOptional[List[str]]Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”)
collapse_duplicatesboolWhether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True.

Outputs

NameTypeDescription
character_matrixpd.DataFrameA character matrix
prior_probsDict[int, Dict[int, float]]A probability dictionary
indel_to_charstateDict[int, Dict[int, str]]A dictionary mapping states to the original mutation

Internal Logic

  1. Filters out samples with intBCs in ignore_intbcs.
  2. Iterates through each cellBC and intBC-cut_site combination.
  3. Assigns a unique integer state to each unique indel observed at a cut site.
  4. Constructs the character matrix by mapping cellBCs to their corresponding character states.
  5. Optionally collapses duplicate character states for a single cellBC-intBC pair.
  6. Optionally calculates prior probabilities for each character state based on the provided mutation_priors.

convert_alleletable_to_lineage_profile

Description

Converts an AlleleTable to a lineage profile, which is a pivot table over the cellBC / intBCs.

Inputs

NameTypeDescription
allele_tablepd.DataFrameAlleleTable
cut_sitesOptional[List[str]]Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”)
collapse_duplicatesboolWhether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True.

Outputs

NameTypeDescription
lineage_profilepd.DataFrameAn NxM lineage profile

Internal Logic

  1. Groups the allele table by cellBC and intBC.
  2. Creates a multi-index DataFrame with cellBCs as rows and intBC-cut_site combinations as columns.
  3. Fills the DataFrame with the observed indels for each cellBC-intBC-cut_site combination.
  4. Optionally collapses duplicate character states for a single cellBC-intBC pair.
  5. Reorders the columns based on the frequency of intBCs.

convert_lineage_profile_to_character_matrix

Description

Converts a lineage profile to a character matrix, where indels are abstracted into integers.

Inputs

NameTypeDescription
lineage_profilepd.DataFrameLineage profile
indel_priorsOptional[pd.DataFrame]Dataframe mapping indels to prior probabilities
missing_allele_indicatorOptional[str]An allele that is being used to represent missing data
missing_state_indicatorintState to indicate missing data. Defaults to -1.

Outputs

NameTypeDescription
character_matrixpd.DataFrameA character matrix
prior_probsDict[int, Dict[int, float]]Prior probability dictionary
indel_to_charstateDict[int, Dict[int, str]]Mapping from character/state pairs to indel identities

Internal Logic

  1. Replaces missing values with “Missing” and optionally replaces a specific missing allele indicator with “Missing”.
  2. Assigns a unique integer state to each unique indel observed at a cut site.
  3. Constructs the character matrix by mapping cellBCs to their corresponding character states.
  4. Optionally calculates prior probabilities for each character state based on the provided indel_priors.

compute_empirical_indel_priors

Description

Computes indel prior probabilities from the input allele table.

Inputs

NameTypeDescription
allele_tablepd.DataFrameAlleleTable
grouping_variablesList[str]Variables to stratify data by, to treat as independent groups in counting indel occurrences. These must be columns in the allele table. Defaults to [“intBC”].
cut_sitesOptional[List[str]]Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r” (e.g. “r1”)

Outputs

NameTypeDescription
indel_priorspd.DataFrameA DataFrame mapping indel identities to the probability

Internal Logic

  1. Groups the allele table by the specified grouping_variables.
  2. Counts the number of times each unique indel occurs in each group.
  3. Calculates the frequency of each indel by dividing its count by the total number of groups.

get_default_cut_site_columns

Description

Retrieves the default cut-sites columns of an AlleleTable, assuming the AlleleTable was created using the Cassiopeia pipeline.

Inputs

NameTypeDescription
allele_tablepd.DataFrameAlleleTable

Outputs

NameTypeDescription
cut_sitesList[str]Columns in the AlleleTable corresponding to the cut sites

Internal Logic

  1. Selects columns from the allele table that match the pattern “r” (e.g., “r1”, “r2”).

Dependencies

DependencyPurpose
ngs_toolsSequence manipulation and analysis
pandasData manipulation and analysis
numpyNumerical operations
pylabPlotting and visualization
pysamReading and manipulating BAM files
reRegular expression operations
tqdmProgress bar visualization

Error Handling

The