High-level description
This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are aForest class representing a collection of phylogenetic trees, a Tree class for individual phylogenetic trees, a SFS (Site Frequency Spectrum) class for analyzing genetic diversity, and a SizeMatchedModel class for statistical modeling based on data size. Additionally, there is a script for generating and annotating reference data in the form of phylogenetic trees.
What does it do?
The code in this directory performs several key functions:-
Manages collections of phylogenetic trees:
- Loads and saves trees from/to various formats (Newick, pickle, tar.gz).
- Generates trees through simulation.
- Concatenates forests of trees.
-
Analyzes individual phylogenetic trees:
- Annotates nodes with features like depth, number of descendants, imbalance, and Colless index.
- Calculates tree statistics such as total branch length and site frequency spectrum.
- Resolves polytomies and rescales trees.
-
Calculates and analyzes Site Frequency Spectra (SFS):
- Computes SFS from tree structures.
- Calculates various genetic diversity statistics (Tajima’s D, Fay and Wu’s H, Zeng’s E, Ferretti’s L).
- Bins SFS data for visualization and analysis.
-
Provides statistical modeling based on data size:
- Implements a model where parameters vary based on input data size.
- Calculates p-values and model means for given sizes.
-
Generates reference data:
- Creates a collection of phylogenetic trees with specified parameters.
- Annotates the trees with various features.
- Saves the annotated forest to a compressed pickle file for later use.
Entry points
The main entry points for using this codebase are:Forestclass: For managing collections of phylogenetic trees.Treeclass: For analyzing individual phylogenetic trees.SFSclass: For calculating and analyzing Site Frequency Spectra.SizeMatchedModelclass: For statistical modeling based on data size.generate_annotate_forest.py: Script for generating reference data.
Key Files
forest.py: Defines theForestclass for managing collections of trees.tree.py: Defines theTreeclass for individual tree analysis.sfs.py: Defines theSFSclass for Site Frequency Spectrum analysis.size_matched_model.py: Defines theSizeMatchedModelclass for size-based statistical modeling.generate_annotate_forest.py: Script for generating and annotating reference data.
Dependencies
The codebase relies on several external libraries:ete3: For phylogenetic tree manipulation and visualization.Bio.Phylo: For interfacing with Biopython’s phylogenetic tree representation.numpy: For numerical computations and array manipulations.scipy: For various scientific computing tasks and statistical functions.pandas: For data manipulation and analysis.matplotlib: For visualization of results and trees.sys: For accessing command line arguments.time: For tracking script execution time.uuid: For generating unique identifiers.pickle: For serializing and saving objects.gzip: For compressing output files.
Configuration
The main classes (Forest, Tree, SFS, SizeMatchedModel) use constructor parameters and method arguments for configuration. Key parameters include:
-
For
ForestandTree:- Paths to input files (Newick, pickle, tar.gz).
- Parameters for tree generation (e.g., number of leaves, alpha for beta-tree model).
- Options for tree manipulation (e.g., resolving polytomies, rescaling).
-
For
SFS:- SFS data as numpy array.
- Binning parameters for SFS analysis.
-
For
SizeMatchedModel:- Bins and parameters for size-based modeling.
- Statistical distribution for calculations.
generate_annotate_forest.py script takes command-line arguments for configuration:
n_leaves: Number of leaves (tips) in each generated tree.n_trees: Number of trees to generate for the forest.alpha: Shape parameter influencing the tree structure.output_dir: Path to the directory where the output file will be saved.
