UMI_utils.py - Stanza Demo

High-level description

This file provides functions for collapsing and preprocessing Unique Molecular Identifiers (UMIs) in sequencing data. It handles tasks like grouping reads by cell barcodes and UMIs, forming consensus sequences for each UMI cluster, and error-correcting UMI sequences.

Code Structure

The code defines several functions that work together to perform UMI collapsing and error correction. The main functions are sort_bam, form_collapsed_clusters, form_clusters, form_clusters_likelihood, and correct_umis_in_group. sort_bam sorts the input BAM file by cell barcode and UMI. form_collapsed_clusters groups reads with the same cell barcode and UMI, and then calls either form_clusters or form_clusters_likelihood to form consensus sequences for each UMI cluster. correct_umis_in_group further groups reads by cell barcode, UMI, and optionally allele, and then performs error correction on the UMI sequences within each group.

References

This file references the following symbols:

cassiopeia.preprocess.constants.BAM_CONSTANTS
cassiopeia.preprocess.utilities.convert_bam_to_df
cassiopeia.mixins.logger
cassiopeia.mixins.PreprocessError
cassiopeia.mixins.PreprocessWarning

Symbols

`detect_cell_bc_tag`

Description

This function determines the appropriate BAM tag for cell barcodes by checking if the corrected cell barcode tag exists in the BAM file. If it does, it returns the corrected tag; otherwise, it returns the raw cell barcode tag.

Inputs

Name	Type	Description
bam_fp	str	Path to the BAM file.

Outputs

Name	Type	Description
	str	The BAM tag to use as the cell barcode.

`group_bam_by_key`

Description

This generator function iterates through a sorted BAM file and yields groups of alignments that share the same key, defined by the sort_key function.

Inputs

Name	Type	Description
sorted_fn	str	Path to the sorted BAM file.
sort_key	Callable	Function that takes an alignment and returns the key to group by.
n_threads	int	Number of threads to use.

Outputs

Name	Type	Description
	Generator[Tuple[Union[str, Tuple[str, …]], List[pysam.AlignedSegment]], None, None]	A generator yielding tuples of two elements: the grouping key and a list of alignments with that key.

`sort_bam`

Description

This function sorts a BAM file by a specified key and filters alignments based on a filter function. It splits the BAM file into chunks, sorts and filters each chunk, and then merges the sorted chunks into a final sorted BAM file.

Inputs

Name	Type	Description
bam_fp	str	Path to the input BAM file.
sorted_fn	str	Path to the output sorted BAM file.
sort_key	Callable	Function that takes an alignment and returns the key to sort by.
filter_func	Callable	Function that takes an alignment and returns True if the alignment should be kept, False otherwise.
n_threads	int	Number of threads to use.

Outputs

Name	Type	Description
max_read_length	int	The maximum read length observed in the BAM file.
total_reads_out	int	The total number of reads written to the sorted BAM file.

`form_collapsed_clusters`

Description

This function groups reads with the same cell barcode and UMI, and then forms consensus sequences for each UMI cluster. It uses either a hard cutoff method or a likelihood-based method to form the initial sequence clusters, and then attempts to merge clusters with similar sequences.

Inputs

Name	Type	Description
sorted_fn	str	Path to the sorted BAM file.
collapsed_fn	str	Path to the output collapsed BAM file.
max_hq_mismatches	int	Maximum number of high-quality mismatches allowed between sequences to be collapsed.
max_indels	int	Maximum number of indels allowed between sequences to be collapsed.
cell_key	Callable	Function that takes an alignment and returns the cell barcode.
UMI_key	Callable	Function that takes an alignment and returns the UMI sequence.
method	Literal[“cutoff”, “likelihood”]	Method to use for forming initial sequence clusters.
n_threads	int	Number of threads to use.

Outputs

Name	Type	Description
	None	This function writes the collapsed BAM file to disk.

`form_clusters`

Description

This function implements the hard cutoff method for forming consensus sequences for UMI clusters. It iteratively finds the most frequent sequence in the remaining alignments and groups all sequences within a certain Hamming distance from it.

Inputs

Name	Type	Description
als	List[pysam.AlignedSegment]	List of alignments to cluster.
max_read_length	int	Maximum read length in the dataset.
max_hq_mismatches	int	Maximum number of high-quality mismatches allowed between sequences to be collapsed.

Outputs

Name	Type	Description
	List[pysam.AlignedSegment]	A list of annotated aligned segments representing the consensus of each cluster.

`form_clusters_likelihood`

Description

This function implements the likelihood-based method for forming consensus sequences for UMI clusters. It uses the quality scores of the bases to calculate the probability of each base being correct and forms consensus sequences based on these probabilities.

Inputs

Name	Type	Description
als	List[pysam.AlignedSegment]	List of alignments to cluster.
q_threshold	Optional[int]	Optional PHRED quality threshold.
proportion	float	Proportion of each sequence to allow mismatched bases to be above `q_threshold`.

Outputs

Name	Type	Description
	List[pysam.AlignedSegment]	A list of annotated aligned segments representing the consensus of each cluster.

`align_clusters`

Description

This function calculates the number of indels and high-quality mismatches between two aligned segments using the Smith-Waterman algorithm.

Inputs

Name	Type	Description
first	pysam.AlignedSegment	The first aligned segment.
second	pysam.AlignedSegment	The second aligned segment.

Outputs

Name	Type	Description
indels	int	The number of differing indels between the two sequences.
num_hq_mismatches	int	The number of high-quality mismatches between the two sequences.

`within_radius_of_seed`

Description

This function takes a seed sequence and a list of alignments and returns two lists: one containing alignments whose sequences are within a certain Hamming distance from the seed sequence, and another containing the remaining alignments.

Inputs

Name	Type	Description
seed	str	The seed sequence.
als	List[pysam.AlignedSegment]	List of alignments to compare to the seed sequence.
max_hq_mismatches	int	Maximum number of high-quality mismatches allowed between a sequence and the seed sequence.

Outputs

Name	Type	Description
near_seed	List[pysam.AlignedSegment]	Alignments within radius of the seed.
remaining	List[pysam.AlignedSegment]	Alignments not within radius of the seed.

`propose_seed`

Description

This function proposes a seed sequence for clustering by finding the most frequent sequence in a list of alignments. If there is no most frequent sequence, it calls call_consensus to generate a consensus sequence.

Inputs

Name	Type	Description
als	List[pysam.AlignedSegment]	List of alignments to propose a seed sequence from.
max_read_length	int	Maximum read length in the dataset.

Outputs

Name	Type	Description
	str	The proposed seed sequence.

`make_singleton_cluster`

Description

This function creates a singleton cluster from a single alignment.

Inputs

Name	Type	Description
al	pysam.AlignedSegment	The alignment to create a singleton cluster from.

Outputs

Name	Type	Description
	pysam.AlignedSegment	An annotated aligned segment representing the singleton cluster.

`call_consensus`

Description

This function generates a consensus sequence from a list of alignments by taking the most frequent high-quality base at each position.

Inputs

Name	Type	Description
als	List[pysam.AlignedSegment]	List of alignments to generate a consensus sequence from.
max_read_length	int	Maximum read length in the dataset.

Outputs

Name	Type	Description
	pysam.AlignedSegment	A consensus annotated aligned segment.

`merge_annotated_clusters`

Description

This function merges two annotated clusters by adding the read count of the smaller cluster to the larger cluster and updating the cluster ID.

Inputs

Name	Type	Description
biggest	pysam.AlignedSegment	The larger of the two clusters.
other	pysam.AlignedSegment	The smaller of the two clusters.

Outputs

Name	Type	Description
	pysam.AlignedSegment	The annotated aligned segment representing the merged cluster.

`correct_umis_in_group`

Description

This function performs error correction on UMI sequences within a group of alignments that share the same cell barcode and integration barcode. It uses a Hamming distance threshold to determine which UMIs should be merged.

Inputs

Name	Type	Description
cell_group	pd.DataFrame	DataFrame of alignments for a single cell barcode and integration barcode.
max_umi_distance	int	Maximum Hamming distance between UMIs for error correction.

Outputs

Name	Type	Description
corrected_cell_group	pd.DataFrame	DataFrame of alignments with error-corrected UMIs.
len(corrections)	int	Number of corrected UMIs.

Internal Logic

The function first calculates the Hamming distance between all pairs of UMIs in the group. It then identifies UMIs that are within the specified distance threshold and merges them into the more abundant UMI.

Side Effects

This function modifies the input DataFrame in place.

Performance Considerations

The function uses a Cython implementation of the Hamming distance calculation to improve performance.

Project Root

​High-level description

​Code Structure

​References

​Symbols

​detect_cell_bc_tag

​Description

​Inputs

​Outputs

​group_bam_by_key

​Description

​Inputs

​Outputs

​sort_bam

​Description

​Inputs

​Outputs

​form_collapsed_clusters

​Description

​Inputs

​Outputs

​form_clusters

​Description

​Inputs

​Outputs

​form_clusters_likelihood

​Description

​Inputs

​Outputs

​align_clusters

​Description

​Inputs

​Outputs

​within_radius_of_seed

​Description

​Inputs

​Outputs

​propose_seed

​Description

​Inputs

​Outputs

​make_singleton_cluster

​Description

​Inputs

​Outputs

​call_consensus

​Description

​Inputs

​Outputs

​merge_annotated_clusters

​Description

​Inputs

​Outputs

​correct_umis_in_group

​Description

​Inputs

​Outputs

​Internal Logic

​Side Effects

​Performance Considerations

High-level description

Code Structure

References

Symbols

`detect_cell_bc_tag`

Description

Inputs

Outputs

`group_bam_by_key`

Description

Inputs

Outputs

`sort_bam`

Description

Inputs

Outputs

`form_collapsed_clusters`

Description

Inputs

Outputs

`form_clusters`

Description

Inputs

Outputs

`form_clusters_likelihood`

Description

Inputs

Outputs

`align_clusters`

Description

Inputs

Outputs

`within_radius_of_seed`

Description

Inputs

Outputs

`propose_seed`

Description

Inputs

Outputs

`make_singleton_cluster`

Description

Inputs

Outputs

`call_consensus`

Description

Inputs

Outputs

`merge_annotated_clusters`

Description

Inputs

Outputs

`correct_umis_in_group`

Description

Inputs

Outputs

Internal Logic

Side Effects

Performance Considerations