Overview - Stanza Demo

High-level description

This directory contains the preprocessing module for Cassiopeia, a tool for phylogenetic inference from single-cell sequencing data. The preprocessing pipeline handles tasks such as converting raw sequencing data to BAM format, filtering reads, error-correcting barcodes and UMIs, collapsing UMIs, aligning sequences, calling alleles, and preparing data for lineage reconstruction.

What does it do?

The preprocessing pipeline takes raw sequencing data as input and performs the following main steps:

Converts FASTQ files to BAM format
Filters low-quality reads
Error-corrects cell barcodes and UMIs
Collapses UMIs to reduce noise
Aligns sequences to a reference
Calls alleles (indels) at target sites
Filters and processes the resulting molecule table
Calls lineage groups
Converts the processed data into formats suitable for phylogenetic inference

These steps prepare the raw sequencing data for downstream analysis, including the construction of lineage trees.

Entry points

The main entry point for the preprocessing pipeline is the cassiopeia_preprocess.py script. This script orchestrates the execution of the various preprocessing steps based on a user-provided configuration file. The pipeline.py file contains the core functions for each preprocessing step, which are called by the main script. The setup_utilities.py file provides functions for parsing the configuration file and setting up the pipeline.

Key Files

UMI_utils.py: Contains functions for collapsing and preprocessing Unique Molecular Identifiers (UMIs).
alignment_utilities.py: Provides utilities for sequence alignment and CIGAR string parsing.
constants.py: Defines constants and default parameters used throughout the preprocessing pipeline.
doublet_utils.py: Contains functions for identifying and filtering potential cell doublets.
lineage_utils.py: Provides functions for calling lineage groups and processing lineage-related data.
map_utils.py: Contains functions for resolving allele ambiguity in molecule tables.
utilities.py: Offers various utility functions for data filtering, conversion, and manipulation.

Dependencies

The preprocessing module relies on several external libraries:

pandas: For data manipulation and analysis
numpy: For numerical operations
pysam: For reading and manipulating BAM files
ngs_tools: For sequence manipulation and analysis
pylab: For plotting and visualization
tqdm: For progress bar visualization

Configuration

The preprocessing pipeline is configured using an INI-format configuration file. This file specifies parameters for each preprocessing step, including input/output paths, filtering thresholds, and algorithm-specific settings. The DEFAULT_PIPELINE_PARAMETERS dictionary in constants.py defines the default values for these parameters. Key configurable parameters include:

Input/output file paths
Sequencing chemistry (e.g., 10xv2, 10xv3)
Quality thresholds for filtering
UMI collapsing parameters
Alignment settings
Allele calling parameters
Cell and UMI filtering thresholds

Users can customize these parameters in the configuration file to adapt the pipeline to their specific experimental setup and data characteristics. In summary, the Cassiopeia preprocessing module provides a comprehensive and flexible pipeline for processing single-cell lineage tracing data, preparing it for downstream phylogenetic analysis.

Project Root

​High-level description

​What does it do?

​Entry points

​Key Files

​Dependencies