High-level description
This file provides functions for collapsing and preprocessing Unique Molecular Identifiers (UMIs) in sequencing data. It handles tasks like grouping reads by cell barcodes and UMIs, forming consensus sequences for each UMI cluster, and error-correcting UMI sequences.
Code Structure
The code defines several functions that work together to perform UMI collapsing and error correction. The main functions are sort_bam
, form_collapsed_clusters
, form_clusters
, form_clusters_likelihood
, and correct_umis_in_group
. sort_bam
sorts the input BAM file by cell barcode and UMI. form_collapsed_clusters
groups reads with the same cell barcode and UMI, and then calls either form_clusters
or form_clusters_likelihood
to form consensus sequences for each UMI cluster. correct_umis_in_group
further groups reads by cell barcode, UMI, and optionally allele, and then performs error correction on the UMI sequences within each group.
References
This file references the following symbols:
cassiopeia.preprocess.constants.BAM_CONSTANTS
cassiopeia.preprocess.utilities.convert_bam_to_df
cassiopeia.mixins.logger
cassiopeia.mixins.PreprocessError
cassiopeia.mixins.PreprocessWarning
Symbols
detect_cell_bc_tag
Description
This function determines the appropriate BAM tag for cell barcodes by checking if the corrected cell barcode tag exists in the BAM file. If it does, it returns the corrected tag; otherwise, it returns the raw cell barcode tag.
Name | Type | Description |
---|
bam_fp | str | Path to the BAM file. |
Outputs
Name | Type | Description |
---|
| str | The BAM tag to use as the cell barcode. |
group_bam_by_key
Description
This generator function iterates through a sorted BAM file and yields groups of alignments that share the same key, defined by the sort_key
function.
Name | Type | Description |
---|
sorted_fn | str | Path to the sorted BAM file. |
sort_key | Callable | Function that takes an alignment and returns the key to group by. |
n_threads | int | Number of threads to use. |
Outputs
Name | Type | Description |
---|
| Generator[Tuple[Union[str, Tuple[str, …]], List[pysam.AlignedSegment]], None, None] | A generator yielding tuples of two elements: the grouping key and a list of alignments with that key. |
sort_bam
Description
This function sorts a BAM file by a specified key and filters alignments based on a filter function. It splits the BAM file into chunks, sorts and filters each chunk, and then merges the sorted chunks into a final sorted BAM file.
Name | Type | Description |
---|
bam_fp | str | Path to the input BAM file. |
sorted_fn | str | Path to the output sorted BAM file. |
sort_key | Callable | Function that takes an alignment and returns the key to sort by. |
filter_func | Callable | Function that takes an alignment and returns True if the alignment should be kept, False otherwise. |
n_threads | int | Number of threads to use. |
Outputs
Name | Type | Description |
---|
max_read_length | int | The maximum read length observed in the BAM file. |
total_reads_out | int | The total number of reads written to the sorted BAM file. |
Description
This function groups reads with the same cell barcode and UMI, and then forms consensus sequences for each UMI cluster. It uses either a hard cutoff method or a likelihood-based method to form the initial sequence clusters, and then attempts to merge clusters with similar sequences.
Name | Type | Description |
---|
sorted_fn | str | Path to the sorted BAM file. |
collapsed_fn | str | Path to the output collapsed BAM file. |
max_hq_mismatches | int | Maximum number of high-quality mismatches allowed between sequences to be collapsed. |
max_indels | int | Maximum number of indels allowed between sequences to be collapsed. |
cell_key | Callable | Function that takes an alignment and returns the cell barcode. |
UMI_key | Callable | Function that takes an alignment and returns the UMI sequence. |
method | Literal[“cutoff”, “likelihood”] | Method to use for forming initial sequence clusters. |
n_threads | int | Number of threads to use. |
Outputs
Name | Type | Description |
---|
| None | This function writes the collapsed BAM file to disk. |
Description
This function implements the hard cutoff method for forming consensus sequences for UMI clusters. It iteratively finds the most frequent sequence in the remaining alignments and groups all sequences within a certain Hamming distance from it.
Name | Type | Description |
---|
als | List[pysam.AlignedSegment] | List of alignments to cluster. |
max_read_length | int | Maximum read length in the dataset. |
max_hq_mismatches | int | Maximum number of high-quality mismatches allowed between sequences to be collapsed. |
Outputs
Name | Type | Description |
---|
| List[pysam.AlignedSegment] | A list of annotated aligned segments representing the consensus of each cluster. |
Description
This function implements the likelihood-based method for forming consensus sequences for UMI clusters. It uses the quality scores of the bases to calculate the probability of each base being correct and forms consensus sequences based on these probabilities.
Name | Type | Description |
---|
als | List[pysam.AlignedSegment] | List of alignments to cluster. |
q_threshold | Optional[int] | Optional PHRED quality threshold. |
proportion | float | Proportion of each sequence to allow mismatched bases to be above q_threshold . |
Outputs
Name | Type | Description |
---|
| List[pysam.AlignedSegment] | A list of annotated aligned segments representing the consensus of each cluster. |
align_clusters
Description
This function calculates the number of indels and high-quality mismatches between two aligned segments using the Smith-Waterman algorithm.
Name | Type | Description |
---|
first | pysam.AlignedSegment | The first aligned segment. |
second | pysam.AlignedSegment | The second aligned segment. |
Outputs
Name | Type | Description |
---|
indels | int | The number of differing indels between the two sequences. |
num_hq_mismatches | int | The number of high-quality mismatches between the two sequences. |
within_radius_of_seed
Description
This function takes a seed sequence and a list of alignments and returns two lists: one containing alignments whose sequences are within a certain Hamming distance from the seed sequence, and another containing the remaining alignments.
Name | Type | Description |
---|
seed | str | The seed sequence. |
als | List[pysam.AlignedSegment] | List of alignments to compare to the seed sequence. |
max_hq_mismatches | int | Maximum number of high-quality mismatches allowed between a sequence and the seed sequence. |
Outputs
Name | Type | Description |
---|
near_seed | List[pysam.AlignedSegment] | Alignments within radius of the seed. |
remaining | List[pysam.AlignedSegment] | Alignments not within radius of the seed. |
propose_seed
Description
This function proposes a seed sequence for clustering by finding the most frequent sequence in a list of alignments. If there is no most frequent sequence, it calls call_consensus
to generate a consensus sequence.
Name | Type | Description |
---|
als | List[pysam.AlignedSegment] | List of alignments to propose a seed sequence from. |
max_read_length | int | Maximum read length in the dataset. |
Outputs
Name | Type | Description |
---|
| str | The proposed seed sequence. |
make_singleton_cluster
Description
This function creates a singleton cluster from a single alignment.
Name | Type | Description |
---|
al | pysam.AlignedSegment | The alignment to create a singleton cluster from. |
Outputs
Name | Type | Description |
---|
| pysam.AlignedSegment | An annotated aligned segment representing the singleton cluster. |
call_consensus
Description
This function generates a consensus sequence from a list of alignments by taking the most frequent high-quality base at each position.
Name | Type | Description |
---|
als | List[pysam.AlignedSegment] | List of alignments to generate a consensus sequence from. |
max_read_length | int | Maximum read length in the dataset. |
Outputs
Name | Type | Description |
---|
| pysam.AlignedSegment | A consensus annotated aligned segment. |
merge_annotated_clusters
Description
This function merges two annotated clusters by adding the read count of the smaller cluster to the larger cluster and updating the cluster ID.
Name | Type | Description |
---|
biggest | pysam.AlignedSegment | The larger of the two clusters. |
other | pysam.AlignedSegment | The smaller of the two clusters. |
Outputs
Name | Type | Description |
---|
| pysam.AlignedSegment | The annotated aligned segment representing the merged cluster. |
correct_umis_in_group
Description
This function performs error correction on UMI sequences within a group of alignments that share the same cell barcode and integration barcode. It uses a Hamming distance threshold to determine which UMIs should be merged.
Name | Type | Description |
---|
cell_group | pd.DataFrame | DataFrame of alignments for a single cell barcode and integration barcode. |
max_umi_distance | int | Maximum Hamming distance between UMIs for error correction. |
Outputs
Name | Type | Description |
---|
corrected_cell_group | pd.DataFrame | DataFrame of alignments with error-corrected UMIs. |
len(corrections) | int | Number of corrected UMIs. |
Internal Logic
The function first calculates the Hamming distance between all pairs of UMIs in the group. It then identifies UMIs that are within the specified distance threshold and merges them into the more abundant UMI.
Side Effects
This function modifies the input DataFrame in place.
The function uses a Cython implementation of the Hamming distance calculation to improve performance.