autocorrelation.py
Here’s a documentation for the target file cassiopeia/tools/autocorrelation.py
:
High-level description
This file provides utility functions for computing autocorrelation statistics on trees, specifically the Moran’s I statistic. It is designed to work with CassiopeiaTree objects and can handle both single and multiple variables for autocorrelation analysis.
Symbols
compute_morans_i
Description
Computes Moran’s I statistic for a given CassiopeiaTree. This statistic measures the spatial autocorrelation of numerical data associated with the leaves of the tree.
Inputs
Name | Type | Description |
---|---|---|
tree | CassiopeiaTree | The tree structure on which to compute the statistic |
meta_columns | Optional[List] | Columns in the tree’s cell_meta object to use for computation |
X | Optional[pd.DataFrame] | Extra data matrix for computing autocorrelations |
W | Optional[pd.DataFrame] | Phylogenetic weight matrix |
inverse_weight_fn | Callable[[Union[int, float]], float] | Function to apply to weights if W is not provided |
Outputs
Name | Type | Description |
---|---|---|
I | Union[float, pd.DataFrame] | Moran’s I statistic(s) |
Internal Logic
- Validates input data and combines meta_columns and X if both are provided.
- Ensures all data is numerical.
- Computes or uses the provided weight matrix W.
- Normalizes W and standardizes X.
- Computes Moran’s I using the formula: I = X’ * Wn * X.
Dependencies
Dependency | Purpose |
---|---|
numpy | Numerical computations |
pandas | Data manipulation |
cassiopeia.data | CassiopeiaTree class and utilities |
cassiopeia.mixins | Custom error classes |
Error Handling
The function raises AutocorrelationError
in the following cases:
- No data is specified for computing autocorrelations.
- The provided X dataframe doesn’t match the tree’s leaves.
- Non-numeric data is found in the specified columns.
- The weight matrix doesn’t match the tree’s leaves.
Performance Considerations
The computation of Moran’s I involves matrix multiplications, which can be computationally expensive for large trees or many variables. The function uses numpy and pandas for efficient calculations, but performance may degrade for very large datasets.
Your response should not exceed 3000 words or 4000 tokens. Focus on providing clear, concise information that can be directly inferred from the code. Include optional sections only when they provide significant value for understanding the code.