High-level description

This directory contains the preprocessing module for Cassiopeia, a tool for phylogenetic inference from single-cell sequencing data. The preprocessing pipeline handles tasks such as converting raw sequencing data to BAM format, filtering reads, error-correcting barcodes and UMIs, collapsing UMIs, aligning sequences, calling alleles, and preparing data for lineage reconstruction.

What does it do?

The preprocessing pipeline takes raw sequencing data as input and performs the following main steps:

  1. Converts FASTQ files to BAM format
  2. Filters low-quality reads
  3. Error-corrects cell barcodes and UMIs
  4. Collapses UMIs to reduce noise
  5. Aligns sequences to a reference
  6. Calls alleles (indels) at target sites
  7. Filters and processes the resulting molecule table
  8. Calls lineage groups
  9. Converts the processed data into formats suitable for phylogenetic inference

These steps prepare the raw sequencing data for downstream analysis, including the construction of lineage trees.

Entry points

The main entry point for the preprocessing pipeline is the cassiopeia_preprocess.py script. This script orchestrates the execution of the various preprocessing steps based on a user-provided configuration file.

The pipeline.py file contains the core functions for each preprocessing step, which are called by the main script.

The setup_utilities.py file provides functions for parsing the configuration file and setting up the pipeline.

Key Files

  • UMI_utils.py: Contains functions for collapsing and preprocessing Unique Molecular Identifiers (UMIs).
  • alignment_utilities.py: Provides utilities for sequence alignment and CIGAR string parsing.
  • constants.py: Defines constants and default parameters used throughout the preprocessing pipeline.
  • doublet_utils.py: Contains functions for identifying and filtering potential cell doublets.
  • lineage_utils.py: Provides functions for calling lineage groups and processing lineage-related data.
  • map_utils.py: Contains functions for resolving allele ambiguity in molecule tables.
  • utilities.py: Offers various utility functions for data filtering, conversion, and manipulation.

Dependencies

The preprocessing module relies on several external libraries:

  • pandas: For data manipulation and analysis
  • numpy: For numerical operations
  • pysam: For reading and manipulating BAM files
  • ngs_tools: For sequence manipulation and analysis
  • pylab: For plotting and visualization
  • tqdm: For progress bar visualization

Configuration

The preprocessing pipeline is configured using an INI-format configuration file. This file specifies parameters for each preprocessing step, including input/output paths, filtering thresholds, and algorithm-specific settings. The DEFAULT_PIPELINE_PARAMETERS dictionary in constants.py defines the default values for these parameters.

Key configurable parameters include:

  • Input/output file paths
  • Sequencing chemistry (e.g., 10xv2, 10xv3)
  • Quality thresholds for filtering
  • UMI collapsing parameters
  • Alignment settings
  • Allele calling parameters
  • Cell and UMI filtering thresholds

Users can customize these parameters in the configuration file to adapt the pipeline to their specific experimental setup and data characteristics.

In summary, the Cassiopeia preprocessing module provides a comprehensive and flexible pipeline for processing single-cell lineage tracing data, preparing it for downstream phylogenetic analysis.