High-level description
This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are:- A
Forestclass representing a collection of phylogenetic trees. - A
Treeclass representing individual phylogenetic trees. - A
SFS(Site Frequency Spectrum) class for analyzing genetic diversity. - A
SizeMatchedModelclass for statistical modeling based on data size.
What does it do?
The code in this directory performs several key functions:-
Manages collections of phylogenetic trees:
- Loads and saves trees from/to various formats (Newick, pickle, tar.gz).
- Generates trees through simulation.
- Concatenates forests of trees.
-
Analyzes individual phylogenetic trees:
- Annotates nodes with features like depth, number of descendants, imbalance, and Colless index.
- Calculates tree statistics such as total branch length and site frequency spectrum.
- Resolves polytomies and rescales trees.
-
Calculates and analyzes Site Frequency Spectra (SFS):
- Computes SFS from tree structures.
- Calculates various genetic diversity statistics (Tajima’s D, Fay and Wu’s H, Zeng’s E, Ferretti’s L).
- Bins SFS data for visualization and analysis.
-
Provides statistical modeling based on data size:
- Implements a model where parameters vary based on input data size.
- Calculates p-values and model means for given sizes.
- Visualizes phylogenetic trees and results.
Entry points
The main entry points for using this codebase are:Forestclass: For managing collections of phylogenetic trees.Treeclass: For analyzing individual phylogenetic trees.SFSclass: For calculating and analyzing Site Frequency Spectra.SizeMatchedModelclass: For statistical modeling based on data size.
Key Files
forest.py: Defines theForestclass for managing collections of trees.tree.py: Defines theTreeclass for individual tree analysis.sfs.py: Defines theSFSclass for Site Frequency Spectrum analysis.size_matched_model.py: Defines theSizeMatchedModelclass for size-based statistical modeling.
Dependencies
The codebase relies on several external libraries:ete3: For phylogenetic tree manipulation and visualization.Bio.Phylo: For interfacing with Biopython’s phylogenetic tree representation.numpy: For numerical computations and array manipulations.scipy: For various scientific computing tasks and statistical functions.pandas: For data manipulation and analysis.matplotlib: For visualization of results and trees.
Configuration
The main classes (Forest, Tree, SFS, SizeMatchedModel) use constructor parameters and method arguments for configuration. Key parameters include:
-
For
ForestandTree:- Paths to input files (Newick, pickle, tar.gz).
- Parameters for tree generation (e.g., number of leaves, alpha for beta-tree model).
- Options for tree manipulation (e.g., resolving polytomies, rescaling).
-
For
SFS:- SFS data as numpy array.
- Binning parameters for SFS analysis.
-
For
SizeMatchedModel:- Bins and parameters for size-based modeling.
- Statistical distribution for calculations.
