High-level description

This file contains functions for preprocessing sequencing data into a character matrix, which is suitable for phylogenetic inference. It provides a series of functions that perform various steps of the preprocessing pipeline, including converting FASTQ files to BAM, filtering reads, collapsing UMIs, aligning sequences, calling alleles, error correcting barcodes and UMIs, filtering molecule tables, and calling lineage groups.

Code Structure

The code defines several functions that are used sequentially in a preprocessing pipeline. The functions are designed to be modular and can be invoked individually or as part of a larger pipeline. The main functions include convert_fastqs_to_unmapped_bam, filter_bam, error_correct_cellbcs_to_whitelist, collapse_umis, resolve_umi_sequence, align_sequences, call_alleles, error_correct_intbcs_to_whitelist, error_correct_umis, filter_molecule_table, and call_lineage_groups. Each function performs a specific step in the preprocessing pipeline and takes a DataFrame or a file path as input and returns a processed DataFrame or a file path.

References

This file references constants defined in cassiopeia.preprocess.constants, utility functions from cassiopeia.preprocess.utilities, and alignment utilities from cassiopeia.preprocess.alignment_utilities. It also uses functions from external libraries such as ngs_tools, pysam, and pandas.

Symbols

convert_fastqs_to_unmapped_bam

Description

Converts FASTQ files to an unmapped BAM file based on the specified chemistry. It adds appropriate BAM tags based on the chemistry used.

Inputs

NameTypeDescription
fastq_fpsList[str]List of paths to FASTQ files.
chemistryLiteral[“dropseq”, “10xv2”, “10xv3”, “indropsv3”, “slideseq2”]Sample-prep/sequencing chemistry used.
output_directorystrThe output directory where the unmapped BAM will be written.
nameOptional[str]Name of the reads in the FASTQs.
n_threadsintNumber of threads to use. Defaults to 1.

Outputs

NameTypeDescription
bam_fpstrPath to the written BAM file.

Internal Logic

The function first checks if the provided chemistry is supported. It then retrieves the appropriate BAM tags for the chemistry and converts the FASTQ files to an unmapped BAM file using the ngs.fastq.fastqs_to_bam_with_chemistry function from the ngs_tools library.

filter_bam

Description

Filters reads in a BAM file that have low quality barcode or UMIs.

Inputs

NameTypeDescription
bam_fpstrInput BAM filepath containing reads to filter.
output_directorystrThe output directory where the filtered BAM will be written.
quality_thresholdintEvery base of the barcode and UMI sequence for a given read must have at least this PHRED quality score to pass filtering. Defaults to 10.
n_threadsintNumber of threads to use. Defaults to 1.

Outputs

NameTypeDescription
filtered_fpstrPath to the filtered BAM file.

Internal Logic

The function defines a filter function that checks the quality scores of the barcode and UMI sequences. It then uses the ngs.bam.filter_bam function from the ngs_tools library to filter the input BAM file based on the filter function.

error_correct_cellbcs_to_whitelist

Description

Error-corrects cell barcodes in the input BAM file to a provided whitelist.

Inputs

NameTypeDescription
bam_fpstrInput BAM filepath containing raw barcodes.
whitelistUnion[str, List[str]]Either a path to a plaintext file containing the barcode whitelist or a list of barcodes.
output_directorystrThe output directory where the corrected BAM will be written.
n_threadsintNumber of threads to use. Defaults to 1.

Outputs

NameTypeDescription
corrected_fpstrPath to the corrected BAM file.

Internal Logic

The function first reads the whitelist from the provided file or list. It then extracts the raw barcodes and their qualities from the input BAM file. It uses the ngs.sequence.correct_sequences_to_whitelist function from the ngs_tools library to correct the barcodes to the whitelist. Finally, it writes the corrected barcodes to a new BAM file.

collapse_umis

Description

Collapses close UMIs together from a BAM file.

Inputs

NameTypeDescription
bam_fpstrFile path of the BAM file.
output_directorystrThe output directory where the sorted BAM directory, the collapsed BAM directory, and the final collapsed table are written.
max_hq_mismatchesintA threshold specifying the maximum number of high quality mismatches between the sequences of two aligned segments to be collapsed. Defaults to 3.
max_indelsintA threshold specifying the maximum number of differing indels allowed between the sequences of two aligned segments to be collapsed. Defaults to 2.
methodLiteral[“cutoff”, “likelihood”]Which method to use to form initial sequence clusters. Defaults to “cutoff”.
n_threadsintNumber of threads to use. Defaults to 1.

Outputs

NameTypeDescription
dfpd.DataFrameA DataFrame of collapsed reads.

Internal Logic

The function first sorts the input BAM file by cell barcode and UMI using the sort_bam function. It then forms collapsed clusters of UMIs using the form_collapsed_clusters function. Finally, it converts the collapsed BAM file to a DataFrame using the utilities.convert_bam_to_df function.

resolve_umi_sequence

Description

Resolves a consensus sequence for each UMI.

Inputs

NameTypeDescription
molecule_tablepd.DataFrameMolecule table to resolve.
output_directorystrDirectory to store results.
min_umi_per_cellintThe threshold specifying the minimum number of UMIs in a cell needed to be retained during filtering. Defaults to 10.
min_avg_reads_per_umifloatThe threshold specifying the minimum coverage (i.e. average) reads per UMI in a cell needed for that cell to be retained during filtering. Defaults to 2.0.
plotboolWhether to plot the number of sequences per UMI and diagnostics after resolving. Defaults to True.

Outputs

NameTypeDescription
filt_molecule_tablepd.DataFrameA molecule table with unique mappings between cellBC-UMI pairs.

Internal Logic

The function first groups the molecule table by cell barcode and UMI. For each group, it selects the most abundant sequence as the consensus sequence and marks the remaining sequences for filtering. It then filters out the marked sequences and applies cell filtering based on the minimum number of UMIs per cell and the minimum average reads per UMI.

align_sequences

Description

Aligns reads to the TargetSite reference.

Inputs

NameTypeDescription
queriespd.DataFrameDataFrame storing a list of sequences to align.
ref_filepathOptional[str]Filepath to the reference FASTA.
refOptional[str]Reference sequence.
gap_open_penaltyfloatGap open penalty. Defaults to 20.
gap_extend_penaltyfloatGap extension penalty. Defaults to 1.
methodLiteral[“local”, “global”]What alignment algorithm to use. Defaults to “local”.
n_threadsintNumber of threads to use. Defaults to 1.

Outputs

NameTypeDescription
alignment_dfpd.DataFrameA DataFrame mapping each sequence name to the CIGAR string, quality, and original query sequence.

Internal Logic

The function first reads the reference sequence from the provided file or string. It then aligns all unique sequences to the reference using either local or global alignment, based on the method argument. The alignments are performed using the alignment_utilities.align_local or alignment_utilities.align_global functions. Finally, the alignments are merged into the input DataFrame.

call_alleles

Description

Calls indels from CIGAR strings.

Inputs

NameTypeDescription
alignmentspd.DataFrameAlignments provided in DataFrame.
ref_filepathOptional[str]Filepath to the reference sequence.
refOptional[str]Nucleotide sequence of the reference.
barcode_intervalTuple[int, int]Interval in reference corresponding to the integration barcode. Defaults to (20, 34).
cutsite_locationsList[int]A list of all cutsite positions in the reference. Defaults to [112, 166, 220].
cutsite_widthintNumber of nucleotides left and right of cutsite location that indels can appear in. Defaults to 12.
contextboolInclude sequence context around indels. Defaults to True.
context_sizeintNumber of bases to the right and left to include as context. Defaults to 5.

Outputs

NameTypeDescription
alignmentspd.DataFrameA DataFrame mapping each sequence alignment to the called indels.

Internal Logic

The function first reads the reference sequence from the provided file or string. It then iterates through the alignments and parses the CIGAR strings using the alignment_utilities.parse_cigar function to extract the indels at each cutsite. The extracted indels are then added to the input DataFrame.

error_correct_intbcs_to_whitelist

Description

Corrects all intBCs to the provided whitelist.

Inputs

NameTypeDescription
input_dfpd.DataFrameInput DataFrame of alignments.
whitelistUnion[str, List[str]]Either a path to a plaintext file containing the barcode whitelist or a list of barcodes.
intbc_dist_threshintThe threshold specifying the maximum Levenshtein distance between the read sequence and whitelist to be corrected. Defaults to 1.

Outputs

NameTypeDescription
input_dfpd.DataFrameA DataFrame of error corrected intBCs.

Internal Logic

The function first reads the whitelist from the provided file or list. It then iterates through the unique intBCs in the input DataFrame and calculates the Levenshtein distance to each whitelist intBC. If the distance is less than or equal to the threshold and there is only one matching whitelist intBC, the intBC is corrected.

error_correct_umis

Description

Within cellBC-intBC pairs, collapses UMIs that have close sequences.

Inputs

NameTypeDescription
input_dfpd.DataFrameInput DataFrame of alignments.
max_umi_distanceintThe threshold specifying the Maximum Hamming distance between UMIs for one to be corrected to another. Defaults to 2.
allow_allele_conflictsboolWhether or not to include the allele when splitting UMIs into allele groups. Defaults to False.
n_threadsintNumber of threads to use. Defaults to 1.

Outputs

NameTypeDescription
alignment_dfpd.DataFrameA DataFrame of error corrected UMIs.

Internal Logic

The function first groups the input DataFrame by cell barcode, intBC, and optionally allele. For each group, it corrects UMIs that have a Hamming distance less than or equal to the threshold using the UMI_utils.correct_umis_in_group function. The corrected UMIs are then merged back into the input DataFrame.

filter_molecule_table

Description

Filters and corrects a molecule table of cellBC-UMI pairs.

Inputs

NameTypeDescription
input_dfpd.DataFrameA molecule table, i.e. cellBC-UMI pairs.
output_directorystrThe output directory path to store plots.
min_umi_per_cellintThe threshold specifying the minimum number of UMIs in a