High-level description

This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are:

  1. A fitness inference module based on the shape of genealogical trees.
  2. A beta coalescent tree simulator and Site Frequency Spectrum (SFS) calculator.

These tools are designed to predict evolution, understand the shape of genealogical trees, and analyze genetic diversity in populations.

What does it do?

The code in this directory performs several key functions:

  1. Infers fitness from genealogical tree shapes:

    • Ranks sequences in multiple sequence alignments based on inferred fitness.
    • Calculates fitness distributions for nodes in phylogenetic trees.
    • Reconstructs ancestral sequences.
    • Analyzes and visualizes phylogenetic trees.
  2. Simulates and analyzes beta coalescent trees:

    • Generates genealogical trees representing ancestral relationships in a population sample.
    • Calculates the Site Frequency Spectrum (SFS) to summarize genetic variation.
  3. Provides tools for evolutionary analysis:

    • Implements the Local Branching Index (LBI) for sequence ranking.
    • Performs full fitness inference on sequences.
    • Simulates adapting populations (in a toy data subdirectory).

Entry points

The main entry points for using this codebase are:

  1. rank_sequences.py: Ranks sequences in a multiple sequence alignment using the Local Branching Index (LBI).
  2. infer_fitness.py: Performs full fitness inference on sequences in a multiple sequence alignment.

Both scripts take a multiple sequence alignment and an outgroup sequence as input, producing various output files including ranked sequences, reconstructed trees, and inferred ancestral sequences.

Key Files

  1. FitnessInference directory:

    • prediction_src/: Core implementation of fitness inference and sequence ranking algorithms.
    • rank_sequences.py: Script for ranking sequences using the Local Branching Index.
    • infer_fitness.py: Script for performing full fitness inference on sequences.
  2. betatree directory:

    • betatree.py: Implements the betatree class for simulating beta coalescent trees.
    • sfs.py and sfs_py3.py: Define the SFS class for calculating the Site Frequency Spectrum.

Dependencies

The codebase relies on several external libraries:

  1. Biopython: For handling biological sequences, alignments, and phylogenetic trees.
  2. NumPy: For numerical computations and array manipulations.
  3. SciPy: For various scientific computing tasks and special mathematical functions.
  4. Matplotlib: For visualization of phylogenetic trees, results, and SFS plots.

Additionally, the code may use external tools like fasttree for phylogenetic tree construction.

Configuration

The main scripts use command-line arguments for configuration. Key parameters include:

  • --aln: Path to the input alignment file.
  • --outgroup: Name of the outgroup sequence.
  • --eps_branch: Minimal branch length for inference.
  • --tau: Time scale for local tree length estimation (for LBI).
  • --diffusion: Fitness diffusion coefficient (for full inference).
  • --gamma: Scale factor for time scale.
  • --omega: Approximate sampling fraction divided by fitness standard deviation.
  • --collapse: Option to collapse internal branches with identical sequences.
  • --plot: Option to plot trees.

For the beta coalescent simulator and SFS calculator, key parameters are passed as arguments to class constructors:

  • sample_size: The number of individuals in the sample.
  • alpha: The alpha parameter of the beta coalescent model.

The SFS class allows for configuration of the SFS calculation and binning process through method parameters such as ntrees, mode, and bins.

Users can adjust these parameters to customize the analysis for their specific needs in evolutionary studies and population genetics research.