Here’s a documentation for the target file cassiopeia/tools/autocorrelation.py:

High-level description

This file provides utility functions for computing autocorrelation statistics on trees, specifically the Moran’s I statistic. It is designed to work with CassiopeiaTree objects and can handle both single and multiple variables for autocorrelation analysis.

Symbols

compute_morans_i

Description

Computes Moran’s I statistic for a given CassiopeiaTree. This statistic measures the spatial autocorrelation of numerical data associated with the leaves of the tree.

Inputs

NameTypeDescription
treeCassiopeiaTreeThe tree structure on which to compute the statistic
meta_columnsOptional[List]Columns in the tree’s cell_meta object to use for computation
XOptional[pd.DataFrame]Extra data matrix for computing autocorrelations
WOptional[pd.DataFrame]Phylogenetic weight matrix
inverse_weight_fnCallable[[Union[int, float]], float]Function to apply to weights if W is not provided

Outputs

NameTypeDescription
IUnion[float, pd.DataFrame]Moran’s I statistic(s)

Internal Logic

  1. Validates input data and combines meta_columns and X if both are provided.
  2. Ensures all data is numerical.
  3. Computes or uses the provided weight matrix W.
  4. Normalizes W and standardizes X.
  5. Computes Moran’s I using the formula: I = X’ * Wn * X.

Dependencies

DependencyPurpose
numpyNumerical computations
pandasData manipulation
cassiopeia.dataCassiopeiaTree class and utilities
cassiopeia.mixinsCustom error classes

Error Handling

The function raises AutocorrelationError in the following cases:

  • No data is specified for computing autocorrelations.
  • The provided X dataframe doesn’t match the tree’s leaves.
  • Non-numeric data is found in the specified columns.
  • The weight matrix doesn’t match the tree’s leaves.

Performance Considerations

The computation of Moran’s I involves matrix multiplications, which can be computationally expensive for large trees or many variables. The function uses numpy and pandas for efficient calculations, but performance may degrade for very large datasets.

Your response should not exceed 3000 words or 4000 tokens. Focus on providing clear, concise information that can be directly inferred from the code. Include optional sections only when they provide significant value for understanding the code.