autocorrelation.py - Glide Demo

Here’s a documentation for the target file cassiopeia/tools/autocorrelation.py:

High-level description

This file provides utility functions for computing autocorrelation statistics on trees, specifically the Moran’s I statistic. It is designed to work with CassiopeiaTree objects and can handle both single and multiple variables for autocorrelation analysis.

Symbols

`compute_morans_i`

Description

Computes Moran’s I statistic for a given CassiopeiaTree. This statistic measures the spatial autocorrelation of numerical data associated with the leaves of the tree.

Inputs

Name	Type	Description
tree	CassiopeiaTree	The tree structure on which to compute the statistic
meta_columns	Optional[List]	Columns in the tree’s cell_meta object to use for computation
X	Optional[pd.DataFrame]	Extra data matrix for computing autocorrelations
W	Optional[pd.DataFrame]	Phylogenetic weight matrix
inverse_weight_fn	Callable[[Union[int, float]], float]	Function to apply to weights if W is not provided

Outputs

Name	Type	Description
I	Union[float, pd.DataFrame]	Moran’s I statistic(s)

Internal Logic

Validates input data and combines meta_columns and X if both are provided.
Ensures all data is numerical.
Computes or uses the provided weight matrix W.
Normalizes W and standardizes X.
Computes Moran’s I using the formula: I = X’ * Wn * X.

Dependencies

Dependency	Purpose
numpy	Numerical computations
pandas	Data manipulation
cassiopeia.data	CassiopeiaTree class and utilities
cassiopeia.mixins	Custom error classes

Error Handling

The function raises AutocorrelationError in the following cases:

No data is specified for computing autocorrelations.
The provided X dataframe doesn’t match the tree’s leaves.
Non-numeric data is found in the specified columns.
The weight matrix doesn’t match the tree’s leaves.

Performance Considerations

The computation of Moran’s I involves matrix multiplications, which can be computationally expensive for large trees or many variables. The function uses numpy and pandas for efficient calculations, but performance may degrade for very large datasets.

Your response should not exceed 3000 words or 4000 tokens. Focus on providing clear, concise information that can be directly inferred from the code. Include optional sections only when they provide significant value for understanding the code.

Project Root

​High-level description

​Symbols

​compute_morans_i

​Description

​Inputs

​Outputs

​Internal Logic

​Dependencies

​Error Handling

​Performance Considerations