High-level description
This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are:- A
Forest
class representing a collection of phylogenetic trees. - A
Tree
class representing individual phylogenetic trees. - A
SFS
(Site Frequency Spectrum) class for analyzing genetic diversity. - A
SizeMatchedModel
class for statistical modeling based on data size.
What does it do?
The code in this directory performs several key functions:-
Manages collections of phylogenetic trees:
- Loads and saves trees from/to various formats (Newick, pickle, tar.gz).
- Generates trees through simulation.
- Concatenates forests of trees.
-
Analyzes individual phylogenetic trees:
- Annotates nodes with features like depth, number of descendants, imbalance, and Colless index.
- Calculates tree statistics such as total branch length and site frequency spectrum.
- Resolves polytomies and rescales trees.
-
Calculates and analyzes Site Frequency Spectra (SFS):
- Computes SFS from tree structures.
- Calculates various genetic diversity statistics (Tajima’s D, Fay and Wu’s H, Zeng’s E, Ferretti’s L).
- Bins SFS data for visualization and analysis.
-
Provides statistical modeling based on data size:
- Implements a model where parameters vary based on input data size.
- Calculates p-values and model means for given sizes.
- Visualizes phylogenetic trees and results.
Entry points
The main entry points for using this codebase are:Forest
class: For managing collections of phylogenetic trees.Tree
class: For analyzing individual phylogenetic trees.SFS
class: For calculating and analyzing Site Frequency Spectra.SizeMatchedModel
class: For statistical modeling based on data size.
Key Files
forest.py
: Defines theForest
class for managing collections of trees.tree.py
: Defines theTree
class for individual tree analysis.sfs.py
: Defines theSFS
class for Site Frequency Spectrum analysis.size_matched_model.py
: Defines theSizeMatchedModel
class for size-based statistical modeling.
Dependencies
The codebase relies on several external libraries:ete3
: For phylogenetic tree manipulation and visualization.Bio.Phylo
: For interfacing with Biopython’s phylogenetic tree representation.numpy
: For numerical computations and array manipulations.scipy
: For various scientific computing tasks and statistical functions.pandas
: For data manipulation and analysis.matplotlib
: For visualization of results and trees.
Configuration
The main classes (Forest
, Tree
, SFS
, SizeMatchedModel
) use constructor parameters and method arguments for configuration. Key parameters include:
-
For
Forest
andTree
:- Paths to input files (Newick, pickle, tar.gz).
- Parameters for tree generation (e.g., number of leaves, alpha for beta-tree model).
- Options for tree manipulation (e.g., resolving polytomies, rescaling).
-
For
SFS
:- SFS data as numpy array.
- Binning parameters for SFS analysis.
-
For
SizeMatchedModel
:- Bins and parameters for size-based modeling.
- Statistical distribution for calculations.