> ## Documentation Index
> Fetch the complete documentation index at: https://demo.agenticlabs.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

## High-level description

This directory contains implementations of various algorithms and tools for analyzing evolutionary fitness and phylogenetic relationships in biological sequences. The main components are:

1. A `Forest` class representing a collection of phylogenetic trees.
2. A `Tree` class representing individual phylogenetic trees.
3. A `SFS` (Site Frequency Spectrum) class for analyzing genetic diversity.
4. A `SizeMatchedModel` class for statistical modeling based on data size.

These tools are designed to analyze genetic diversity, infer evolutionary fitness, and understand the shape of genealogical trees in populations.

## What does it do?

The code in this directory performs several key functions:

1. Manages collections of phylogenetic trees:
   * Loads and saves trees from/to various formats (Newick, pickle, tar.gz).
   * Generates trees through simulation.
   * Concatenates forests of trees.

2. Analyzes individual phylogenetic trees:
   * Annotates nodes with features like depth, number of descendants, imbalance, and Colless index.
   * Calculates tree statistics such as total branch length and site frequency spectrum.
   * Resolves polytomies and rescales trees.

3. Calculates and analyzes Site Frequency Spectra (SFS):
   * Computes SFS from tree structures.
   * Calculates various genetic diversity statistics (Tajima's D, Fay and Wu's H, Zeng's E, Ferretti's L).
   * Bins SFS data for visualization and analysis.

4. Provides statistical modeling based on data size:
   * Implements a model where parameters vary based on input data size.
   * Calculates p-values and model means for given sizes.

5. Visualizes phylogenetic trees and results.

## Entry points

The main entry points for using this codebase are:

1. `Forest` class: For managing collections of phylogenetic trees.
2. `Tree` class: For analyzing individual phylogenetic trees.
3. `SFS` class: For calculating and analyzing Site Frequency Spectra.
4. `SizeMatchedModel` class: For statistical modeling based on data size.

## Key Files

1. `forest.py`: Defines the `Forest` class for managing collections of trees.
2. `tree.py`: Defines the `Tree` class for individual tree analysis.
3. `sfs.py`: Defines the `SFS` class for Site Frequency Spectrum analysis.
4. `size_matched_model.py`: Defines the `SizeMatchedModel` class for size-based statistical modeling.

## Dependencies

The codebase relies on several external libraries:

1. `ete3`: For phylogenetic tree manipulation and visualization.
2. `Bio.Phylo`: For interfacing with Biopython's phylogenetic tree representation.
3. `numpy`: For numerical computations and array manipulations.
4. `scipy`: For various scientific computing tasks and statistical functions.
5. `pandas`: For data manipulation and analysis.
6. `matplotlib`: For visualization of results and trees.

Additionally, the code uses custom modules for specific tasks like beta-tree simulation and fitness inference.

## Configuration

The main classes (`Forest`, `Tree`, `SFS`, `SizeMatchedModel`) use constructor parameters and method arguments for configuration. Key parameters include:

* For `Forest` and `Tree`:
  * Paths to input files (Newick, pickle, tar.gz).
  * Parameters for tree generation (e.g., number of leaves, alpha for beta-tree model).
  * Options for tree manipulation (e.g., resolving polytomies, rescaling).

* For `SFS`:
  * SFS data as numpy array.
  * Binning parameters for SFS analysis.

* For `SizeMatchedModel`:
  * Bins and parameters for size-based modeling.
  * Statistical distribution for calculations.

Users can adjust these parameters to customize the analysis for their specific needs in evolutionary studies and population genetics research.
