High-level description

This file provides functions for collapsing and preprocessing Unique Molecular Identifiers (UMIs) in sequencing data. It handles tasks like grouping reads by cell barcodes and UMIs, forming consensus sequences for each UMI cluster, and error-correcting UMI sequences.

Code Structure

The code defines several functions that work together to perform UMI collapsing and error correction. The main functions are sort_bam, form_collapsed_clusters, form_clusters, form_clusters_likelihood, and correct_umis_in_group. sort_bam sorts the input BAM file by cell barcode and UMI. form_collapsed_clusters groups reads with the same cell barcode and UMI, and then calls either form_clusters or form_clusters_likelihood to form consensus sequences for each UMI cluster. correct_umis_in_group further groups reads by cell barcode, UMI, and optionally allele, and then performs error correction on the UMI sequences within each group.

References

This file references the following symbols:

  • cassiopeia.preprocess.constants.BAM_CONSTANTS
  • cassiopeia.preprocess.utilities.convert_bam_to_df
  • cassiopeia.mixins.logger
  • cassiopeia.mixins.PreprocessError
  • cassiopeia.mixins.PreprocessWarning

Symbols

detect_cell_bc_tag

Description

This function determines the appropriate BAM tag for cell barcodes by checking if the corrected cell barcode tag exists in the BAM file. If it does, it returns the corrected tag; otherwise, it returns the raw cell barcode tag.

Inputs

NameTypeDescription
bam_fpstrPath to the BAM file.

Outputs

NameTypeDescription
strThe BAM tag to use as the cell barcode.

group_bam_by_key

Description

This generator function iterates through a sorted BAM file and yields groups of alignments that share the same key, defined by the sort_key function.

Inputs

NameTypeDescription
sorted_fnstrPath to the sorted BAM file.
sort_keyCallableFunction that takes an alignment and returns the key to group by.
n_threadsintNumber of threads to use.

Outputs

NameTypeDescription
Generator[Tuple[Union[str, Tuple[str, …]], List[pysam.AlignedSegment]], None, None]A generator yielding tuples of two elements: the grouping key and a list of alignments with that key.

sort_bam

Description

This function sorts a BAM file by a specified key and filters alignments based on a filter function. It splits the BAM file into chunks, sorts and filters each chunk, and then merges the sorted chunks into a final sorted BAM file.

Inputs

NameTypeDescription
bam_fpstrPath to the input BAM file.
sorted_fnstrPath to the output sorted BAM file.
sort_keyCallableFunction that takes an alignment and returns the key to sort by.
filter_funcCallableFunction that takes an alignment and returns True if the alignment should be kept, False otherwise.
n_threadsintNumber of threads to use.

Outputs

NameTypeDescription
max_read_lengthintThe maximum read length observed in the BAM file.
total_reads_outintThe total number of reads written to the sorted BAM file.

form_collapsed_clusters

Description

This function groups reads with the same cell barcode and UMI, and then forms consensus sequences for each UMI cluster. It uses either a hard cutoff method or a likelihood-based method to form the initial sequence clusters, and then attempts to merge clusters with similar sequences.

Inputs

NameTypeDescription
sorted_fnstrPath to the sorted BAM file.
collapsed_fnstrPath to the output collapsed BAM file.
max_hq_mismatchesintMaximum number of high-quality mismatches allowed between sequences to be collapsed.
max_indelsintMaximum number of indels allowed between sequences to be collapsed.
cell_keyCallableFunction that takes an alignment and returns the cell barcode.
UMI_keyCallableFunction that takes an alignment and returns the UMI sequence.
methodLiteral[“cutoff”, “likelihood”]Method to use for forming initial sequence clusters.
n_threadsintNumber of threads to use.

Outputs

NameTypeDescription
NoneThis function writes the collapsed BAM file to disk.

form_clusters

Description

This function implements the hard cutoff method for forming consensus sequences for UMI clusters. It iteratively finds the most frequent sequence in the remaining alignments and groups all sequences within a certain Hamming distance from it.

Inputs

NameTypeDescription
alsList[pysam.AlignedSegment]List of alignments to cluster.
max_read_lengthintMaximum read length in the dataset.
max_hq_mismatchesintMaximum number of high-quality mismatches allowed between sequences to be collapsed.

Outputs

NameTypeDescription
List[pysam.AlignedSegment]A list of annotated aligned segments representing the consensus of each cluster.

form_clusters_likelihood

Description

This function implements the likelihood-based method for forming consensus sequences for UMI clusters. It uses the quality scores of the bases to calculate the probability of each base being correct and forms consensus sequences based on these probabilities.

Inputs

NameTypeDescription
alsList[pysam.AlignedSegment]List of alignments to cluster.
q_thresholdOptional[int]Optional PHRED quality threshold.
proportionfloatProportion of each sequence to allow mismatched bases to be above q_threshold.

Outputs

NameTypeDescription
List[pysam.AlignedSegment]A list of annotated aligned segments representing the consensus of each cluster.

align_clusters

Description

This function calculates the number of indels and high-quality mismatches between two aligned segments using the Smith-Waterman algorithm.

Inputs

NameTypeDescription
firstpysam.AlignedSegmentThe first aligned segment.
secondpysam.AlignedSegmentThe second aligned segment.

Outputs

NameTypeDescription
indelsintThe number of differing indels between the two sequences.
num_hq_mismatchesintThe number of high-quality mismatches between the two sequences.

within_radius_of_seed

Description

This function takes a seed sequence and a list of alignments and returns two lists: one containing alignments whose sequences are within a certain Hamming distance from the seed sequence, and another containing the remaining alignments.

Inputs

NameTypeDescription
seedstrThe seed sequence.
alsList[pysam.AlignedSegment]List of alignments to compare to the seed sequence.
max_hq_mismatchesintMaximum number of high-quality mismatches allowed between a sequence and the seed sequence.

Outputs

NameTypeDescription
near_seedList[pysam.AlignedSegment]Alignments within radius of the seed.
remainingList[pysam.AlignedSegment]Alignments not within radius of the seed.

propose_seed

Description

This function proposes a seed sequence for clustering by finding the most frequent sequence in a list of alignments. If there is no most frequent sequence, it calls call_consensus to generate a consensus sequence.

Inputs

NameTypeDescription
alsList[pysam.AlignedSegment]List of alignments to propose a seed sequence from.
max_read_lengthintMaximum read length in the dataset.

Outputs

NameTypeDescription
strThe proposed seed sequence.

make_singleton_cluster

Description

This function creates a singleton cluster from a single alignment.

Inputs

NameTypeDescription
alpysam.AlignedSegmentThe alignment to create a singleton cluster from.

Outputs

NameTypeDescription
pysam.AlignedSegmentAn annotated aligned segment representing the singleton cluster.

call_consensus

Description

This function generates a consensus sequence from a list of alignments by taking the most frequent high-quality base at each position.

Inputs

NameTypeDescription
alsList[pysam.AlignedSegment]List of alignments to generate a consensus sequence from.
max_read_lengthintMaximum read length in the dataset.

Outputs

NameTypeDescription
pysam.AlignedSegmentA consensus annotated aligned segment.

merge_annotated_clusters

Description

This function merges two annotated clusters by adding the read count of the smaller cluster to the larger cluster and updating the cluster ID.

Inputs

NameTypeDescription
biggestpysam.AlignedSegmentThe larger of the two clusters.
otherpysam.AlignedSegmentThe smaller of the two clusters.

Outputs

NameTypeDescription
pysam.AlignedSegmentThe annotated aligned segment representing the merged cluster.

correct_umis_in_group

Description

This function performs error correction on UMI sequences within a group of alignments that share the same cell barcode and integration barcode. It uses a Hamming distance threshold to determine which UMIs should be merged.

Inputs

NameTypeDescription
cell_grouppd.DataFrameDataFrame of alignments for a single cell barcode and integration barcode.
max_umi_distanceintMaximum Hamming distance between UMIs for error correction.

Outputs

NameTypeDescription
corrected_cell_grouppd.DataFrameDataFrame of alignments with error-corrected UMIs.
len(corrections)intNumber of corrected UMIs.

Internal Logic

The function first calculates the Hamming distance between all pairs of UMIs in the group. It then identifies UMIs that are within the specified distance threshold and merges them into the more abundant UMI.

Side Effects

This function modifies the input DataFrame in place.

Performance Considerations

The function uses a Cython implementation of the Hamming distance calculation to improve performance.