High-level description

This directory contains unit tests for the preprocessing module of the Cassiopeia library. The tests cover various aspects of the preprocessing pipeline, including sequence alignment, allele calling, lineage group assignment, character matrix formation, UMI collapsing, and configuration parsing.

What does it do?

The tests in this directory verify the correctness of different preprocessing steps in the Cassiopeia pipeline. These steps include:

  1. Aligning sequences to a reference
  2. Calling alleles from aligned sequences
  3. Assigning lineage groups to cells
  4. Converting allele tables to character matrices
  5. Collapsing UMIs (Unique Molecular Identifiers)
  6. Parsing configuration files for the pipeline
  7. Converting FASTQ files to unmapped BAM files
  8. Error-correcting cell barcodes and integration barcodes
  9. Filtering BAM files and molecule tables
  10. Resolving UMI sequences

Each test file focuses on a specific aspect of the preprocessing pipeline, ensuring that the functions and methods perform as expected under various input conditions and edge cases.

Entry points

The main entry points for developers working on the preprocessing module tests are:

  1. align_sequence_test.py: Tests for sequence alignment functionality
  2. call_alleles_test.py: Tests for allele calling from aligned sequences
  3. call_lineage_groups_test.py: Tests for lineage group assignment
  4. character_matrix_test.py: Tests for converting allele tables to character matrices
  5. collapse_umi_test.py: Tests for UMI collapsing functionality
  6. config_parser_test.py: Tests for parsing configuration files

These files contain the core tests for the main preprocessing steps. Other test files focus on more specific functionalities or edge cases within the preprocessing pipeline.

Key Files

  1. align_sequence_test.py: Tests sequence alignment with different parameters and methods.
  2. call_alleles_test.py: Verifies correct allele calling from CIGAR strings and aligned sequences.
  3. call_lineage_groups_test.py: Checks lineage group assignment, including handling of doublets and reassignment.
  4. character_matrix_test.py: Tests conversion of allele tables to character matrices and lineage profiles.
  5. collapse_umi_test.py: Verifies UMI collapsing for different sequencing chemistries.
  6. config_parser_test.py: Ensures correct parsing of configuration files and pipeline setup.
  7. convert_fastqs_to_unmapped_bam_test.py: Tests conversion of FASTQ files to unmapped BAM files.
  8. error_correct_cellbcs_to_whitelist_test.py: Checks error correction of cell barcodes against a whitelist.
  9. error_correct_intbcs_to_whitelist_test.py: Verifies error correction of integration barcodes.
  10. error_correct_umi_test.py: Tests UMI error correction functionality.
  11. filter_bam_test.py: Checks filtering of BAM files based on read quality.
  12. filter_molecule_table_test.py: Tests filtering of molecule tables based on various criteria.
  13. resolve_umi_sequence_test.py: Verifies UMI sequence resolution and cell filtering.

Dependencies

The test files rely on the following main dependencies:

  1. unittest: Python’s built-in unit testing framework
  2. numpy: For numerical operations
  3. pandas: For data manipulation and analysis
  4. pysam: For reading and manipulating SAM/BAM files
  5. cassiopeia: The main package being tested

Additional dependencies include:

  • os, shutil, tempfile: For file and directory operations
  • pathlib: For handling file paths
  • ngs_tools: For FASTQ file handling

Configuration

Many test files use configuration parameters to set up test scenarios. These configurations are typically defined within the test methods or in the setUp methods of test classes. Some tests also read configuration files to verify the correct parsing of pipeline settings.

The tests cover various configuration scenarios, including:

  • Different sequencing chemistries (e.g., 10x Genomics v2/v3, Drop-seq, inDrops v3, Slide-seq v2)
  • Various alignment parameters and methods
  • Different filtering thresholds for UMIs, cell barcodes, and read counts
  • Error correction settings for cell barcodes, integration barcodes, and UMIs

By testing these different configurations, the test suite ensures that the Cassiopeia preprocessing pipeline can handle a wide range of input data and user-defined settings.