Here’s a comprehensive documentation for the compare.py file in the cassiopeia/critique module:

High-level description

This module provides functions for comparing two phylogenetic trees. It includes implementations for calculating the triplets correct accuracy and the Robinson-Foulds distance between two trees.

Code Structure

The module contains two main functions: triplets_correct and robinson_foulds. These functions operate on CassiopeiaTree objects, which represent phylogenetic trees. The triplets_correct function is more complex, involving sampling and comparison of triplets at different depths of the trees, while the robinson_foulds function is a wrapper around the Ete3 library’s implementation of the Robinson-Foulds distance.

Symbols

triplets_correct

Description

This function calculates the triplets correct accuracy between two trees. It samples triplets (sets of three leaves) at each depth of the tree and compares their topology between the two input trees.

Inputs

NameTypeDescription
tree1CassiopeiaTreeThe first input tree
tree2CassiopeiaTreeThe second input tree to be compared
number_of_trialsintNumber of triplets to sample at each depth (default: 1000)
min_triplets_at_depthintMinimum number of triplets needed at a depth for inclusion (default: 1)

Outputs

NameTypeDescription
all_triplets_correctDict[int, float]The proportion of all triplets correct at each depth
resolvable_triplets_correctDict[int, float]The proportion of resolvable triplets correct at each depth
unresolved_triplets_correctDict[int, float]The proportion of unresolvable triplets correct at each depth
proportion_unresolvableDict[int, float]The proportion of unresolvable triplets at each depth

Internal Logic

  1. Initialize dictionaries to store results
  2. Create copies of input trees and collapse unifurcations
  3. Annotate tree depths and compute number of triplets at each depth
  4. For each depth:
    • Sample triplets and compare their topology between the two trees
    • Calculate various statistics (all, resolvable, unresolved triplets correct)
  5. Return the computed statistics

robinson_foulds

Description

This function computes the Robinson-Foulds distance between two trees. It’s a wrapper around the Ete3 library’s implementation.

Inputs

NameTypeDescription
tree1CassiopeiaTreeThe first input tree
tree2CassiopeiaTreeThe second input tree to be compared

Outputs

NameTypeDescription
rffloatThe Robinson-Foulds distance between the two trees
rf_maxfloatThe maximum possible Robinson-Foulds distance for these trees

Internal Logic

  1. Create copies of input trees and collapse unifurcations
  2. Convert trees to Ete3 format
  3. Compute Robinson-Foulds distance using Ete3’s implementation
  4. Return the distance and maximum possible distance

Dependencies

DependencyPurpose
collectionsUsed for defaultdict
copyUsed for deep copying trees
ete3Used for Robinson-Foulds distance calculation
networkxNot directly used in this file, but likely used in CassiopeiaTree
numpyUsed for numerical operations
typingUsed for type hinting
cassiopeia.critique.critique_utilitiesUsed for utility functions
cassiopeia.data.CassiopeiaTreeUsed as the main data structure for trees

Error Handling

The code does not implement explicit error handling. It relies on the underlying libraries and data structures to raise appropriate exceptions.

This module provides essential functionality for comparing phylogenetic trees in the Cassiopeia project. The triplets_correct function offers a detailed comparison of tree topologies at different depths, while the robinson_foulds function provides a standard distance metric for tree comparison. These functions are likely used in various analysis and validation tasks within the larger Cassiopeia framework.