High-level description
This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are:- A fitness inference module based on the shape of genealogical trees.
- A beta coalescent tree simulator and Site Frequency Spectrum (SFS) calculator.
What does it do?
The code in this directory performs several key functions:-
Infers fitness from genealogical tree shapes:
- Ranks sequences in multiple sequence alignments based on inferred fitness.
- Calculates fitness distributions for nodes in phylogenetic trees.
- Reconstructs ancestral sequences.
- Analyzes and visualizes phylogenetic trees.
-
Simulates and analyzes beta coalescent trees:
- Generates genealogical trees representing ancestral relationships in a population sample.
- Calculates the Site Frequency Spectrum (SFS) to summarize genetic variation.
-
Provides tools for evolutionary analysis:
- Implements the Local Branching Index (LBI) for sequence ranking.
- Performs full fitness inference on sequences.
- Simulates adapting populations (in a toy data subdirectory).
Entry points
The main entry points for using this codebase are:rank_sequences.py
: Ranks sequences in a multiple sequence alignment using the Local Branching Index (LBI).infer_fitness.py
: Performs full fitness inference on sequences in a multiple sequence alignment.
Key Files
-
FitnessInference directory:
prediction_src/
: Core implementation of fitness inference and sequence ranking algorithms.rank_sequences.py
: Script for ranking sequences using the Local Branching Index.infer_fitness.py
: Script for performing full fitness inference on sequences.
-
betatree directory:
betatree.py
: Implements thebetatree
class for simulating beta coalescent trees.sfs.py
andsfs_py3.py
: Define theSFS
class for calculating the Site Frequency Spectrum.
Dependencies
The codebase relies on several external libraries:- Biopython: For handling biological sequences, alignments, and phylogenetic trees.
- NumPy: For numerical computations and array manipulations.
- SciPy: For various scientific computing tasks and special mathematical functions.
- Matplotlib: For visualization of phylogenetic trees, results, and SFS plots.
fasttree
for phylogenetic tree construction.
Configuration
The main scripts use command-line arguments for configuration. Key parameters include:--aln
: Path to the input alignment file.--outgroup
: Name of the outgroup sequence.--eps_branch
: Minimal branch length for inference.--tau
: Time scale for local tree length estimation (for LBI).--diffusion
: Fitness diffusion coefficient (for full inference).--gamma
: Scale factor for time scale.--omega
: Approximate sampling fraction divided by fitness standard deviation.--collapse
: Option to collapse internal branches with identical sequences.--plot
: Option to plot trees.
sample_size
: The number of individuals in the sample.alpha
: The alpha parameter of the beta coalescent model.
SFS
class allows for configuration of the SFS calculation and binning process through method parameters such as ntrees
, mode
, and bins
.
Users can adjust these parameters to customize the analysis for their specific needs in evolutionary studies and population genetics research.