High-level description

This directory contains code for inferring fitness from the shape of genealogical trees, as described in the manuscript “Predicting evolution from the shape of genealogical trees” by Neher, Russell, and Shraiman. The code implements algorithms for fitness inference, sequence ranking, and phylogenetic tree analysis.

What does it do?

The code in this directory performs several key functions:

  1. Ranks sequences in a multiple sequence alignment based on their inferred fitness.
  2. Infers fitness distributions for nodes in phylogenetic trees.
  3. Reconstructs ancestral sequences.
  4. Analyzes and visualizes phylogenetic trees.
  5. Simulates adapting populations (in the toy_data subdirectory).

The main workflows include:

  1. Using the Local Branching Index (LBI) to rank sequences based on their evolutionary fitness.
  2. Performing full fitness inference to calculate mean posterior and variance of fitness for sequences.
  3. Reconstructing phylogenetic trees from sequence alignments.
  4. Visualizing trees with fitness information.

Entry points

The main entry points for using this codebase are:

  1. rank_sequences.py: A script that ranks sequences in a multiple sequence alignment using the Local Branching Index (LBI).

  2. infer_fitness.py: A script that performs full fitness inference on sequences in a multiple sequence alignment.

Both scripts take a multiple sequence alignment and an outgroup sequence as input, and produce various output files including ranked sequences, reconstructed trees, and inferred ancestral sequences.

Key Files

  1. prediction_src/: A directory containing the core implementation of fitness inference and sequence ranking algorithms. Key modules include:

    • sequence_ranking.py: Combines sequence alignment processing with node ranking and fitness inference.
    • fitness_inference.py: Implements the core algorithm for inferring fitness distributions on phylogenetic trees.
    • node_ranking.py: Extends fitness inference functionality to provide methods for ranking and coloring tree nodes.
    • ancestral.py: Implements maximum likelihood estimation for ancestral sequence reconstruction.
    • tree_utils.py: Provides utility functions for manipulating and visualizing phylogenetic trees.
  2. rank_sequences.py: Script for ranking sequences using the Local Branching Index.

  3. infer_fitness.py: Script for performing full fitness inference on sequences.

Dependencies

The codebase relies on several external libraries:

  1. Biopython: For handling biological sequences, alignments, and phylogenetic trees.
  2. NumPy: For numerical computations and array manipulations.
  3. SciPy: For various scientific computing tasks.
  4. Matplotlib: For visualization of phylogenetic trees and results.

Additionally, the code may use external tools like fasttree for phylogenetic tree construction.

Configuration

The main scripts (rank_sequences.py and infer_fitness.py) use command-line arguments for configuration. Key parameters include:

  • --aln: Path to the input alignment file.
  • --outgroup: Name of the outgroup sequence.
  • --eps_branch: Minimal branch length for inference.
  • --tau: Time scale for local tree length estimation (for LBI).
  • --diffusion: Fitness diffusion coefficient (for full inference).
  • --gamma: Scale factor for time scale.
  • --omega: Approximate sampling fraction divided by fitness standard deviation.
  • --collapse: Option to collapse internal branches with identical sequences.
  • --plot: Option to plot trees.

Users can adjust these parameters to customize the analysis for their specific needs.

In conclusion, this codebase provides a comprehensive set of tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences, with applications in predicting evolution and understanding the shape of genealogical trees.