High-level description

This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are a Forest class representing a collection of phylogenetic trees, a Tree class for individual phylogenetic trees, a SFS (Site Frequency Spectrum) class for analyzing genetic diversity, and a SizeMatchedModel class for statistical modeling based on data size. Additionally, there is a script for generating and annotating reference data in the form of phylogenetic trees.

What does it do?

The code in this directory performs several key functions:

  1. Manages collections of phylogenetic trees:

    • Loads and saves trees from/to various formats (Newick, pickle, tar.gz).
    • Generates trees through simulation.
    • Concatenates forests of trees.
  2. Analyzes individual phylogenetic trees:

    • Annotates nodes with features like depth, number of descendants, imbalance, and Colless index.
    • Calculates tree statistics such as total branch length and site frequency spectrum.
    • Resolves polytomies and rescales trees.
  3. Calculates and analyzes Site Frequency Spectra (SFS):

    • Computes SFS from tree structures.
    • Calculates various genetic diversity statistics (Tajima’s D, Fay and Wu’s H, Zeng’s E, Ferretti’s L).
    • Bins SFS data for visualization and analysis.
  4. Provides statistical modeling based on data size:

    • Implements a model where parameters vary based on input data size.
    • Calculates p-values and model means for given sizes.
  5. Generates reference data:

    • Creates a collection of phylogenetic trees with specified parameters.
    • Annotates the trees with various features.
    • Saves the annotated forest to a compressed pickle file for later use.

Entry points

The main entry points for using this codebase are:

  1. Forest class: For managing collections of phylogenetic trees.
  2. Tree class: For analyzing individual phylogenetic trees.
  3. SFS class: For calculating and analyzing Site Frequency Spectra.
  4. SizeMatchedModel class: For statistical modeling based on data size.
  5. generate_annotate_forest.py: Script for generating reference data.

Key Files

  1. forest.py: Defines the Forest class for managing collections of trees.
  2. tree.py: Defines the Tree class for individual tree analysis.
  3. sfs.py: Defines the SFS class for Site Frequency Spectrum analysis.
  4. size_matched_model.py: Defines the SizeMatchedModel class for size-based statistical modeling.
  5. generate_annotate_forest.py: Script for generating and annotating reference data.

Dependencies

The codebase relies on several external libraries:

  1. ete3: For phylogenetic tree manipulation and visualization.
  2. Bio.Phylo: For interfacing with Biopython’s phylogenetic tree representation.
  3. numpy: For numerical computations and array manipulations.
  4. scipy: For various scientific computing tasks and statistical functions.
  5. pandas: For data manipulation and analysis.
  6. matplotlib: For visualization of results and trees.
  7. sys: For accessing command line arguments.
  8. time: For tracking script execution time.
  9. uuid: For generating unique identifiers.
  10. pickle: For serializing and saving objects.
  11. gzip: For compressing output files.

Additionally, the code uses custom modules for specific tasks like beta-tree simulation and fitness inference.

Configuration

The main classes (Forest, Tree, SFS, SizeMatchedModel) use constructor parameters and method arguments for configuration. Key parameters include:

  • For Forest and Tree:

    • Paths to input files (Newick, pickle, tar.gz).
    • Parameters for tree generation (e.g., number of leaves, alpha for beta-tree model).
    • Options for tree manipulation (e.g., resolving polytomies, rescaling).
  • For SFS:

    • SFS data as numpy array.
    • Binning parameters for SFS analysis.
  • For SizeMatchedModel:

    • Bins and parameters for size-based modeling.
    • Statistical distribution for calculations.

The generate_annotate_forest.py script takes command-line arguments for configuration:

  • n_leaves: Number of leaves (tips) in each generated tree.
  • n_trees: Number of trees to generate for the forest.
  • alpha: Shape parameter influencing the tree structure.
  • output_dir: Path to the directory where the output file will be saved.

Users can adjust these parameters to customize the analysis for their specific needs in evolutionary studies and population genetics research.