Source code for pasted._metrics

"""
pasted._metrics
===============
Disorder-metric computations for PASTED structures.

The primary entry point is :func:`compute_all_metrics`, which accepts an
atom list and a position array and dispatches internally to the per-metric
helpers.  Lower-level helpers have varying signatures: some take ``pts``
(an (N, 3) NumPy array), others take ``dmat`` (a full N×N distance matrix),
and :func:`compute_h_atom` takes only an element-symbol list.  All
cutoff-based metrics share the same *cutoff* distance threshold (Å).

Acceleration paths
------------------
**C++ single-pass path** (``HAS_COMBINED = True``, v0.4.0+):
:func:`compute_all_metrics` calls ``all_metrics_cpp``, which builds
a single ``FlatCellList`` and accumulates **all** pair-based metrics
(RDF, graph, Steinhardt, bond-angle entropy, radial variance,
local anisotropy) in one traversal.  Speedup vs. the legacy path:
~1.9× at N=1000.

**C++ legacy path** (``HAS_GRAPH = True``, ``HAS_COMBINED = False``):
:func:`compute_all_metrics` calls ``rdf_h_cpp``, ``graph_metrics_cpp``,
``steinhardt_per_atom``, and ``bond_angle_entropy_cpp`` independently,
each building its own ``FlatCellList``.

**Pure-Python fallback** (``HAS_GRAPH = False``): :func:`compute_h_spatial`
and :func:`compute_rdf_deviation` use ``scipy.spatial.cKDTree`` (O(N·k)),
but ``_compute_graph_ring_charge`` falls back to a full O(N²)
``pdist`` / ``squareform`` matrix.  For N ≳ 500 this path becomes
significantly slower; reinstall with a C++17 compiler to enable
``HAS_GRAPH = True``.

Metrics catalog
---------------
All 17 metrics returned by :func:`compute_all_metrics` (13 original
+ 4 adversarial metrics added in v0.4.0):

=========================  =========  ==========================================
Metric                     Range      Description
=========================  =========  ==========================================
``H_atom``                 ≥ 0        Shannon entropy of element composition
``H_spatial``              ≥ 0        Shannon entropy of pair-distance histogram
``H_total``                ≥ 0        ``w_atom*H_atom + w_spatial*H_spatial``
``RDF_dev``                ≥ 0        RMS deviation of empirical g(r) from ideal gas
``shape_aniso``            [0, 1]     Relative shape anisotropy (0=sphere, 1=rod)
``Q4``, ``Q6``, ``Q8``    [0, 1]     Steinhardt bond-order parameters
``graph_lcc``              [0, 1]     Largest connected-component fraction
``graph_cc``               [0, 1]     Mean clustering coefficient
``ring_fraction``          [0, 1]     Fraction of atoms in at least one cycle
``charge_frustration``     ≥ 0        Variance of |Δχ| across cutoff-adjacent pairs
``moran_I_chi``            (-∞, 1]    Moran's I spatial autocorrelation for Pauling EN
``bond_angle_entropy``     [0, ln 36] Mean per-atom bond-angle distribution entropy
``coordination_variance``  ≥ 0        Population variance of per-atom coordination number
``radial_variance``        ≥ 0 Ų    Mean per-atom variance of neighbor distances
``local_anisotropy``       [0, 1]     Mean per-atom local covariance tensor anisotropy
=========================  =========  ==========================================

Changelog highlights
--------------------
* **v0.4.0**: ``_combined_core`` single-pass C++ kernel added; all 17 metrics
  now computed in one ``FlatCellList`` traversal when ``HAS_COMBINED=True``
  (~1.9× speedup at N=1000).  Four adversarial metrics introduced in v0.4.0-alpha
  (``bond_angle_entropy``, ``coordination_variance``, ``radial_variance``,
  ``local_anisotropy``) are now part of the stable ``ALL_METRICS`` set.
* **v0.3.10**: ``compute_shape_anisotropy`` guard tightened from ``tr == 0``
  to ``tr < 1e-30``, fixing a ``ZeroDivisionError`` on subnormal coordinate
  differences.
* **v0.3.9**: ``_steinhardt_per_atom_sparse`` masks bonds with d < 1e-10 Å,
  matching the C++ threshold and fixing a C++/Python consistency bug.
* **v0.3.8**: ``moran_I_chi`` clamped to 1.0 to fix sparse-graph artefact.
* **v0.3.6**: Tarjan bridge-finding replaces Union-Find in
  ``compute_ring_fraction``, fixing systematic undercounting of ring members.
"""

from __future__ import annotations

import math
from collections import Counter
from collections.abc import Iterator
from typing import TYPE_CHECKING

import numpy as np
from scipy.sparse import csr_matrix as _csr_matrix
from scipy.sparse.csgraph import connected_components as _connected_components
from scipy.spatial import cKDTree as _cKDTree
from scipy.spatial.distance import pdist as _pdist
from scipy.spatial.distance import squareform as _squareform

# scipy >= 1.15 renamed sph_harm -> sph_harm_y with a different argument order.
# The except branch is kept for environments that still run scipy < 1.15;
# warn_unused_ignores is suppressed for this file in pyproject.toml
# [tool.mypy.overrides].
try:
    from scipy.special import sph_harm_y as _sph_harm_raw

    def _sph_harm(
        l: int,  # noqa: E741
        m: int,
        phi_azimuth: float | np.ndarray,
        theta_polar: float | np.ndarray,
    ) -> np.ndarray:
        return np.asarray(_sph_harm_raw(l, m, theta_polar, phi_azimuth))

except ImportError:
    from scipy.special import sph_harm as _sph_harm_raw

    def _sph_harm(
        l: int,  # noqa: E741
        m: int,
        phi_azimuth: float | np.ndarray,
        theta_polar: float | np.ndarray,
    ) -> np.ndarray:
        return np.asarray(_sph_harm_raw(m, l, phi_azimuth, theta_polar))


if TYPE_CHECKING:
    from ._placement import Vec3

from ._atoms import cov_radius_ang as _cov_radius_ang
from ._atoms import pauling_electronegativity as _pauling_en

# Optional C++ acceleration
from ._ext import HAS_GRAPH as _HAS_GRAPH
from ._ext import HAS_STEINHARDT as _HAS_STEINHARDT
from ._ext import graph_metrics_cpp as _cpp_graph_metrics
from ._ext import rdf_h_cpp as _rdf_h_cpp

if _HAS_STEINHARDT:
    from ._ext import steinhardt_per_atom as _steinhardt_per_atom_cpp

from ._ext import HAS_BA_CPP as _HAS_BA_CPP

if _HAS_BA_CPP:
    from ._ext import bond_angle_entropy_cpp as _bond_angle_entropy_cpp

from ._ext import HAS_COMBINED as _HAS_COMBINED

if _HAS_COMBINED:
    from ._ext import all_metrics_cpp as _all_metrics_cpp

from .neighbor_list import NeighborList

# ---------------------------------------------------------------------------
# Low-level entropy helper
# ---------------------------------------------------------------------------


def _shannon_np(counts: np.ndarray) -> float:
    """Shannon entropy from a raw (un-normalized) count array."""
    total = counts.sum()
    if total == 0:
        return 0.0
    p = counts[counts > 0] / total
    return float(-np.sum(p * np.log(p)))


# ---------------------------------------------------------------------------
# Individual metric functions
# ---------------------------------------------------------------------------


[docs] def compute_h_atom(atoms: list[str]) -> float: """Shannon entropy of the element composition. Range: 0 (pure single element) to ln(*k*) for *k* distinct elements. """ counts_dict = Counter(atoms) return _shannon_np(np.array(list(counts_dict.values()), dtype=float))
[docs] def compute_h_spatial(pts: np.ndarray, cutoff: float, n_bins: int) -> float: """Shannon entropy of the pair-distance histogram within *cutoff*. Only pairs with ``d_ij <= cutoff`` are included, matching the locality assumption used by all other metrics. Higher values indicate a more uniform distribution of short-range distances. Parameters ---------- pts: Positions array of shape ``(n, 3)``. cutoff: Neighbor distance cutoff (Å). n_bins: Number of histogram bins over ``[0, cutoff]``. """ n = len(pts) if n < 2: return 0.0 tree = _cKDTree(pts) pairs = tree.query_pairs(cutoff, output_type="ndarray") if len(pairs) == 0: return 0.0 dists = np.linalg.norm(pts[pairs[:, 0]] - pts[pairs[:, 1]], axis=1) counts, _ = np.histogram(dists, bins=n_bins, range=(0.0, cutoff)) return _shannon_np(counts.astype(float))
[docs] def compute_rdf_deviation(pts: np.ndarray, cutoff: float, n_bins: int) -> float: """RMS deviation of the empirical *g*(*r*) from an ideal-gas baseline. A value of 0 indicates a perfectly random (ideal-gas-like) distribution of pair distances within *cutoff*. Only pairs with ``d_ij <= cutoff`` are included in the histogram, consistent with the other local metrics. Parameters ---------- pts: Positions array of shape ``(n, 3)``. cutoff: Neighbor distance cutoff (Å). The histogram range is ``[0, cutoff]``. n_bins: Number of histogram bins. """ n = len(pts) if n < 2: return 0.0 centroid = pts.mean(axis=0) r_bound = float(np.sqrt(((pts - centroid) ** 2).sum(axis=1)).max()) if r_bound == 0.0: return 0.0 tree = _cKDTree(pts) pairs = tree.query_pairs(cutoff, output_type="ndarray") if len(pairs) == 0: return 0.0 dists = np.linalg.norm(pts[pairs[:, 0]] - pts[pairs[:, 1]], axis=1) rho = n / (4.0 / 3.0 * math.pi * r_bound**3) counts, edges = np.histogram(dists, bins=n_bins, range=(0.0, cutoff)) centers = (edges[:-1] + edges[1:]) / 2.0 bw = edges[1] - edges[0] ideal = rho * 4.0 * math.pi * centers**2 * bw * n / 2.0 mask = ideal > 0 if not mask.any(): return 0.0 return float(np.sqrt(np.mean(((counts[mask] / ideal[mask]) - 1.0) ** 2)))
[docs] def compute_shape_anisotropy(pts: np.ndarray) -> float: """Relative shape anisotropy from the gyration tensor. Range: [0, 1] (0=spherical, 1=rod-like). Returns NaN when: * The input contains fewer than 2 atoms. * Any coordinate is non-finite (NaN or Inf). In earlier versions a structure with Inf coordinates would emit a NumPy ``RuntimeWarning: invalid value encountered in subtract`` from the mean-centering step; the explicit ``np.isfinite`` guard (added in v0.4.0) suppresses that warning and returns NaN cleanly. Returns 0.0 when all atoms are coincident (gyration tensor trace < 1e-30), guarding against ``ZeroDivisionError`` from subnormal or identical coordinates. Implementation note ------------------- Only the sum of eigenvalues (``trace``) and the sum of squared eigenvalues (Frobenius norm squared) of the 3×3 gyration tensor ``T`` are needed:: trace(T) = λ₁ + λ₂ + λ₃ ‖T‖²_F = λ₁² + λ₂² + λ₃² Computing these directly avoids the LAPACK ``eigvalsh`` call (~1.5× faster per call, saving ~10 ms over a 500-step optimizer run). The near-zero guard ``tr < 1e-30`` replaces the previous exact ``tr == 0`` check, which failed to catch subnormal coordinate differences (e.g. two atoms at positions differing by ≈ 6e-108 Å) and raised ``ZeroDivisionError``. """ if len(pts) < 2: return float("nan") # Guard against NaN/Inf coordinates: pts containing Inf would propagate # through the mean subtraction and trigger a NumPy RuntimeWarning # ("invalid value encountered in subtract"). Return NaN for such inputs. if not np.all(np.isfinite(pts)): return float("nan") p = pts - pts.mean(axis=0) T = (p.T @ p) / len(p) tr = float(T[0, 0] + T[1, 1] + T[2, 2]) # trace(T) = λ₁+λ₂+λ₃ if tr < 1e-30: # guard: all atoms at the same point (or subnormal coords) return 0.0 tr2 = float(np.einsum("ij,ij->", T, T)) # ‖T‖²_F = λ₁²+λ₂²+λ₃² return float(np.clip(1.5 * tr2 / tr**2 - 0.5, 0.0, 1.0))
def _steinhardt_per_atom_sparse( pts: np.ndarray, l_values: list[int], cutoff: float, ) -> dict[str, np.ndarray]: """Pure-Python O(N*k) fallback for :func:`compute_steinhardt_per_atom`. Enumerates neighbor pairs via ``scipy.spatial.cKDTree`` and evaluates spherical harmonics only for actual neighbor pairs (``d_ij <= cutoff``), giving O(N*k) complexity (k = mean neighbor count). Parameters ---------- pts: Positions array of shape ``(n, 3)``. l_values: List of *l* values (e.g. ``[4, 6, 8]``). cutoff: Neighbor distance cutoff (Å). """ n = len(pts) result: dict[str, np.ndarray] = {} tree = _cKDTree(pts) pairs = tree.query_pairs(cutoff, output_type="ndarray") if len(pairs) == 0: for lv in l_values: result[f"Q{lv}"] = np.zeros(n, dtype=float) return result # Build directed (both-way) neighbor index from undirected pairs rows = np.concatenate([pairs[:, 0], pairs[:, 1]]) cols = np.concatenate([pairs[:, 1], pairs[:, 0]]) diff_nb = pts[rows] - pts[cols] # (n_bonds, 3) r_nb = np.linalg.norm(diff_nb, axis=1) # Exclude coincident pairs (d < 1e-10): bond direction is undefined. # The C++ path skips these with the same threshold; both paths must agree. valid = r_nb >= 1e-10 if not np.all(valid): rows = rows[valid] cols = cols[valid] diff_nb = diff_nb[valid] r_nb = r_nb[valid] if len(rows) == 0: for lv in l_values: result[f"Q{lv}"] = np.zeros(n, dtype=float) return result d_hat_nb = diff_nb / r_nb[:, np.newaxis] # (n_bonds, 3) theta_nb = np.arccos(np.clip(d_hat_nb[:, 2], -1.0, 1.0)) # (n_bonds,) phi_nb = np.arctan2(d_hat_nb[:, 1], d_hat_nb[:, 0]) # (n_bonds,) deg = np.bincount(rows, minlength=n).astype(float) safe_deg = np.where(deg > 0, deg, 1.0) for lv in l_values: qlm_sq = np.zeros(n, dtype=float) for m in range(-lv, lv + 1): ylm_nb = _sph_harm(lv, m, phi_nb, theta_nb) # (n_bonds,) complex re_sum = np.bincount(rows, weights=ylm_nb.real, minlength=n) im_sum = np.bincount(rows, weights=ylm_nb.imag, minlength=n) avg_sq = (re_sum / safe_deg) ** 2 + (im_sum / safe_deg) ** 2 qlm_sq += avg_sq ql = np.sqrt(4 * math.pi / (2 * lv + 1) * qlm_sq) result[f"Q{lv}"] = np.where(deg > 0, ql, 0.0) return result
[docs] def compute_steinhardt_per_atom( pts: np.ndarray, l_values: list[int], cutoff: float, ) -> dict[str, np.ndarray]: """Per-atom Steinhardt Q_l values. When the C++ extension ``pasted._ext._steinhardt_core`` is available (``HAS_STEINHARDT = True``), the computation uses a sparse neighbor list built internally by the extension, evaluating spherical harmonics only for actual neighbor pairs. This gives O(N*k) complexity (k = mean neighbor count). The C++ accumulator uses layout ``(N, n_l, l_max+1)`` with atom index outermost (v0.3.6+), so every bond's ``(l_idx, m)`` writes are contiguous (stride 8 B). The former ``(n_l, l_max+1, N)`` layout wrote at strides of ``N × 8 bytes``, causing L2→L3 cache spill for N ≳ 1 000 and superlinear wall-time growth. Since v0.3.7 the per-bond phi-trig cost is also eliminated: ``atan2`` is replaced by a single ``sqrt + 2 divs`` to obtain ``cos_phi`` / ``sin_phi``, and all higher orders ``cos(m·phi)`` / ``sin(m·phi)`` follow from the Chebyshev two-term recurrence (2 mults + 1 sub each) instead of ``l_max`` separate libm calls. The P_lm table is now stack-allocated (``double[13][13]``) rather than heap-allocated per bond. Since v0.3.8 (optimization ④), when ``l_values = [4, 6, 8]`` a hardcoded Cartesian-polynomial fast-path is used instead of the associated-Legendre recurrence. Every real spherical harmonic ``S_lm(x,y,z)`` is a pure integer-coefficient polynomial on the unit sphere; SymPy joint CSE across l=4,6,8 yields 84 intermediates and 39 accumulation lines with no ``sqrt``, ``atan2``, or ``std::pow``. Measured speedup: **1.4–1.6×** vs. ①②③ at N = 100–1 000 (gas structures, k ≈ 0.7, ``-O3 -std=c++17``). When the extension is absent the function falls back to a sparse Python/NumPy implementation using ``scipy.spatial.cKDTree`` for neighbor enumeration and ``np.bincount`` for accumulation. Both paths have the same O(N*k) complexity. Parameters ---------- pts: Positions array of shape ``(n, 3)``. l_values: List of *l* values (e.g. ``[4, 6, 8]``). cutoff: Neighbor distance cutoff (Å). Returns ------- dict mapping ``"Q{l}"`` to a :class:`numpy.ndarray` of shape ``(n,)``. Atoms with no neighbors within *cutoff* are assigned Q_l = 0. """ if _HAS_STEINHARDT: raw: dict[str, np.ndarray] = _steinhardt_per_atom_cpp(pts, cutoff, l_values) return raw return _steinhardt_per_atom_sparse(pts, l_values, cutoff)
[docs] def compute_steinhardt( pts: np.ndarray, l_values: list[int], cutoff: float, ) -> dict[str, float]: """Steinhardt Q_l averaged over all atoms. Delegates to :func:`compute_steinhardt_per_atom` and returns the per-structure mean for each *l*. Parameters ---------- pts: Positions array of shape ``(n, 3)``. l_values: List of *l* values (e.g. ``[4, 6, 8]``). cutoff: Neighbor distance cutoff (Å). Returns ------- dict mapping ``"Q{l}"`` to its global average value. """ per_atom = compute_steinhardt_per_atom(pts, l_values, cutoff) return {k: float(v.mean()) for k, v in per_atom.items()}
[docs] def compute_graph_metrics(dmat: np.ndarray, cutoff: float) -> dict[str, float]: """Largest connected-component fraction and mean local clustering coefficient [0, 1]. Pure-Python fallback used when ``HAS_GRAPH`` is ``False``. The C++ path in ``graph_metrics_cpp`` is preferred and is invoked automatically by :func:`_compute_graph_ring_charge`. Parameters ---------- dmat: Full n x n pairwise distance matrix. cutoff: Adjacency distance cutoff (Å). Returns ------- dict with keys ``"graph_lcc"`` and ``"graph_cc"``. """ n = len(dmat) if n < 2: return {"graph_lcc": 1.0, "graph_cc": 0.0} adj = dmat <= cutoff np.fill_diagonal(adj, False) _, labels = _connected_components(_csr_matrix(adj), directed=False, return_labels=True) graph_lcc = float(np.bincount(labels).max()) / n deg = adj.sum(axis=1).astype(float) A = adj.astype(float) tri = (A @ A * A).sum(axis=1) / 2.0 max_tri = deg * (deg - 1) / 2.0 mask = max_tri > 0 graph_cc = float(np.mean(tri[mask] / max_tri[mask])) if mask.any() else 0.0 return {"graph_lcc": graph_lcc, "graph_cc": graph_cc}
# --------------------------------------------------------------------------- # MM-level structural descriptors # --------------------------------------------------------------------------- def _build_adj(n: int, dmat: np.ndarray, cutoff: float) -> list[list[int]]: """Build an undirected adjacency list from a distance matrix. Parameters ---------- n: Number of atoms. dmat: Full n x n pairwise distance matrix (Å). cutoff: Distance threshold; a pair (i, j) is bonded when ``1e-6 < dmat[i, j] <= cutoff``. Returns ------- list[list[int]] ``adj[i]`` is the list of atom indices bonded to atom *i*. """ adj: list[list[int]] = [[] for _ in range(n)] for i in range(n): for j in range(i + 1, n): if 1e-6 < dmat[i, j] <= cutoff: adj[i].append(j) adj[j].append(i) return adj def _tarjan_bridges(adj: list[list[int]], n: int) -> set[tuple[int, int]]: """Find all bridge edges using Tarjan's iterative DFS algorithm. A *bridge* is an edge whose removal disconnects the graph (i.e. it is not part of any cycle). An atom is in at least one ring if and only if at least one of its incident edges is a non-bridge. This iterative implementation avoids Python's recursion limit, which can be hit for large or deeply connected structures. Time complexity --------------- O(N + E) where E is the number of edges (O(N · k) for sparse graphs). Parameters ---------- adj: Undirected adjacency list (output of :func:`_build_adj`). n: Number of vertices. Returns ------- set of (int, int) Each bridge is stored as ``(min(u, v), max(u, v))``. """ disc: list[int] = [-1] * n low: list[int] = [0] * n bridges: set[tuple[int, int]] = set() timer = 0 for start in range(n): if disc[start] != -1: continue disc[start] = low[start] = timer timer += 1 # Stack entries: (vertex, parent, iterator over its neighbors) stack: list[tuple[int, int, Iterator[int]]] = [(start, -1, iter(adj[start]))] while stack: u, parent_v, it = stack[-1] try: v = next(it) if disc[v] == -1: # Tree edge: discover v disc[v] = low[v] = timer timer += 1 stack.append((v, u, iter(adj[v]))) elif v != parent_v: # Back edge: tighten low[u] low[u] = min(low[u], disc[v]) except StopIteration: # All neighbors of u exhausted; propagate low upward stack.pop() if stack: pu = stack[-1][0] low[pu] = min(low[pu], low[u]) if low[u] > disc[pu]: bridges.add((min(u, pu), max(u, pu))) return bridges
[docs] def compute_ring_fraction( atoms: list[str], dmat: np.ndarray, cutoff: float, ) -> float: """Fraction of atoms that belong to at least one ring. Builds a neighbor graph from the distance matrix and finds all bridge edges using Tarjan's iterative DFS algorithm (O(N + E)). An atom is considered to be *in a ring* if and only if at least one of its incident edges is a non-bridge — that is, it participates in a cycle. This correctly identifies **all** members of every cycle, including atoms reached only via tree edges that connect into a cycle. The previous Union-Find implementation only marked the two direct endpoints of each detected back-edge, which systematically undercounted ring membership (e.g. a 3-cycle was reported as 2/3 instead of 3/3, a 6-cycle as 2/6 instead of 6/6). .. note:: The C++ ``graph_metrics_cpp`` path (``HAS_GRAPH = True``) has been updated in the same release to use the same Tarjan algorithm, so both paths now return consistent values. Parameters ---------- atoms: Element symbols (unused; retained for API symmetry with other metrics). dmat: Full n x n pairwise distance matrix (Å). cutoff: Distance cutoff (Å). A pair is counted as bonded when ``d_ij <= cutoff``. Returns ------- float Fraction of atoms in at least one ring, in [0, 1]. Returns 0.0 for structures with fewer than three atoms or no cycles. """ n = len(atoms) if n < 3: return 0.0 adj = _build_adj(n, dmat, cutoff) bridges = _tarjan_bridges(adj, n) in_ring = [False] * n for i in range(n): for j in adj[i]: if (min(i, j), max(i, j)) not in bridges: in_ring[i] = True break # one non-bridge edge is enough return float(sum(in_ring) / n)
[docs] def compute_charge_frustration( atoms: list[str], dmat: np.ndarray, cutoff: float, ) -> float: """Variance of Pauling electronegativity differences across neighbor pairs. For each neighbor pair (i, j) within *cutoff*, the absolute electronegativity difference ``abs(chi_i - chi_j)`` is computed. The metric is the *variance* of these differences over all neighbor pairs. A high value indicates a structure where electronegativity differences are inconsistently distributed across bonds: some neighbors are well matched while others are highly mismatched. This is analogous to *charge frustration* in disordered materials, where local charge neutrality cannot be satisfied simultaneously at every site. Noble gases and elements without a Pauling value use the module-level fallback of 1.0 (see :func:`~pasted._atoms.pauling_electronegativity`). Parameters ---------- atoms: Element symbols. dmat: Full n x n pairwise distance matrix (Å). cutoff: Distance cutoff (Å). A pair is counted as connected when ``d_ij <= cutoff``. Returns ------- float Variance of ``abs(delta-chi)`` across all neighbor pairs. Returns 0.0 when fewer than two neighbor pairs are detected (variance is undefined for a single observation). """ n = len(atoms) en = [_pauling_en(sym) for sym in atoms] diffs: list[float] = [] for i in range(n): for j in range(i + 1, n): if 1e-6 < dmat[i, j] <= cutoff: diffs.append(abs(en[i] - en[j])) if len(diffs) < 2: return 0.0 arr = np.array(diffs, dtype=float) return float(np.var(arr))
# --------------------------------------------------------------------------- # Unified entry point # ---------------------------------------------------------------------------
[docs] def compute_moran_I_chi( atoms: list[str], dmat: np.ndarray, cutoff: float, ) -> float: r"""Moran\'s I spatial autocorrelation for Pauling electronegativity. Measures whether atoms with similar electronegativity cluster spatially. .. math:: I = \frac{N}{W} \frac{\sum_{i \neq j} w_{ij}(\chi_i - \bar{\chi}) (\chi_j - \bar{\chi})}{\sum_i (\chi_i - \bar{\chi})^2} where :math:`w_{ij} = 1` when :math:`d_{ij} \leq` *cutoff* and 0 otherwise. Parameters ---------- atoms: Element symbols. dmat: Full n x n pairwise distance matrix (Å). cutoff: Distance cutoff for the step-function weight matrix (Å). Returns ------- float Moran\'s I clamped to ``(-∞, 1]``. Returns 0.0 when all atoms share electronegativity or no pair falls within *cutoff*. * I ~= 0 : random spatial arrangement (target for disordered structures) * I > 0 : same-electronegativity atoms cluster spatially * I < 0 : alternating high/low electronegativity (ionic-crystal-like) .. note:: Binary (0/1) weights are used rather than row-standardised weights, so the raw ``(n/W) * numer/denom`` can exceed +1 when the graph is very sparse (W < n). The result is clamped to 1.0 to honour the documented upper bound. """ chi = np.array([_pauling_en(a) for a in atoms], dtype=float) chi_bar = chi.mean() dev = chi - chi_bar denom = float(np.sum(dev**2)) if denom < 1e-30: return 0.0 n = len(atoms) W = (dmat > 1e-6) & (dmat <= cutoff) W_sum = float(W.sum()) if W_sum < 1e-30: return 0.0 numer = float((W * np.outer(dev, dev)).sum()) raw = float((n / W_sum) * (numer / denom)) return min(raw, 1.0)
def _compute_graph_ring_charge( atoms: list[str], pts: np.ndarray, radii: np.ndarray, cov_scale: float, cutoff: float, en_vals: np.ndarray, ) -> dict[str, float]: """Dispatch graph_lcc/cc, ring_fraction, charge_frustration, moran_I_chi. Uses the C++ ``graph_metrics_cpp`` (``HAS_GRAPH = True``) when available; this path builds a ``FlatCellList`` once and computes all five metrics in O(N·k) with a single adjacency list (no intermediate distance matrix). When ``HAS_GRAPH = False`` (C++ extension unavailable), falls back to the pure-Python implementations in this module. .. warning:: **O(N²) Python fallback — emergency use only.** This path constructs a full N×N distance matrix via ``scipy.spatial.distance.pdist`` / ``squareform``, which is an **O(N²) memory and time operation**. At N=500 the matrix alone occupies ~2 MB and the wall time is ~100× slower than the C++ path (~100 ms vs ~1 ms). At N=2000 the matrix is ~32 MB and the call takes several seconds. This fallback exists only for environments where the C++ extension cannot be compiled. **It is not intended for production use.** Install with a C++17 compiler (``pip install -e .`` after ``pip install pybind11``) to enable ``HAS_GRAPH = True``. """ if _HAS_GRAPH: # Single C++ call: FlatCellList built once, all 5 metrics computed. return dict(_cpp_graph_metrics(pts, radii, cov_scale, en_vals, cutoff)) # Pure-Python fallback dmat = _squareform(_pdist(pts)) return { **compute_graph_metrics(dmat, cutoff), "ring_fraction": compute_ring_fraction(atoms, dmat, cutoff), "charge_frustration": compute_charge_frustration(atoms, dmat, cutoff), "moran_I_chi": compute_moran_I_chi(atoms, dmat, cutoff), }
[docs] def compute_all_metrics( atoms: list[str], positions: list[Vec3], n_bins: int = 20, w_atom: float = 0.5, w_spatial: float = 0.5, cutoff: float | None = None, cov_scale: float = 1.0, ) -> dict[str, float]: """Compute all disorder metrics for a single structure. The exact count is ``len(ALL_METRICS)`` (currently :data:`~pasted._atoms.ALL_METRICS`). **C++ path** (``HAS_GRAPH = True``): all pair-enumeration uses a single shared adjacency list built via ``FlatCellList`` (O(N·k)). The adjacency list is constructed once and reused for all five graph/ring/charge/Moran metrics. ``graph_cc`` triangle counting uses sorted adjacency lists with ``binary_search`` for O(N·k²·log k) rather than the former O(N·k³) ``std::find`` scan. ``rdf_h_cpp`` streams distances directly into the histogram without materialising an intermediate ``pair_dists`` vector. ``scipy.spatial.distance.pdist`` / ``squareform`` are **never called**. **Pure-Python fallback** (``HAS_GRAPH = False``): :func:`compute_h_spatial` and :func:`compute_rdf_deviation` use ``scipy.spatial.cKDTree`` (O(N·k)), but the five graph/ring/charge/Moran metrics are computed via :func:`_compute_graph_ring_charge`, which builds a full N×N distance matrix with ``pdist`` / ``squareform`` — an **O(N²) operation** that is ~100× slower than the C++ path at N=500. This fallback is intended only for environments where the C++ extension cannot be compiled. Parameters ---------- atoms: Element symbols. positions: Cartesian coordinates (Å). n_bins: Histogram bins for :func:`compute_h_spatial` and :func:`compute_rdf_deviation`. w_atom: Weight of ``H_atom`` in ``H_total``. w_spatial: Weight of ``H_spatial`` in ``H_total``. cutoff: Distance cutoff (Å) for all local metrics. When None (the default), auto-computed as 1.5 × median(r_i + r_j) over covalent radii. cov_scale: Retained for API compatibility; no longer used internally. Defaults to ``1.0``. Returns ------- dict with keys matching :data:`pasted._atoms.ALL_METRICS`. """ pts = np.array(positions, dtype=float) # (n, 3) radii = np.array([_cov_radius_ang(a) for a in atoms]) en_vals = np.array([_pauling_en(a) for a in atoms]) if cutoff is None: # Auto-compute: 1.5 × median(r_i + r_j). # Using the O(N) identity: median(r_i + r_j) ≈ 2 × median(r_i). # This matches the approach used in place_maxent (v0.2.6+) and # avoids the O(N² log N) pair enumeration + sort that dominated # wall time for large structures (e.g. ~27× slower at N=1000). # NOTE: this approximation is accurate for unimodal radius distributions # (typical element pools). For strongly bimodal pools (e.g. H+heavy # metals) the error is up to ~10%; pass an explicit cutoff= if needed. median_sum = float(np.median(radii)) * 2.0 cutoff = cov_scale * 1.5 * median_sum ha = compute_h_atom(atoms) if _HAS_COMBINED: raw = dict(_all_metrics_cpp(pts, radii, en_vals, cutoff, n_bins)) hs = float(raw["h_spatial"]) rdf_dev = float(raw["rdf_dev"]) graph_result = { k: float(raw[k]) for k in ( "graph_lcc", "graph_cc", "ring_fraction", "charge_frustration", "moran_I_chi", ) } stein_result = {f"Q{lv}": float(np.asarray(raw[f"Q{lv}"]).mean()) for lv in [4, 6, 8]} adv_result = { k: float(raw[k]) for k in ( "bond_angle_entropy", "coordination_variance", "radial_variance", "local_anisotropy", ) } return { "H_atom": ha, "H_spatial": hs, "H_total": w_atom * ha + w_spatial * hs, "RDF_dev": rdf_dev, "shape_aniso": compute_shape_anisotropy(pts), **stein_result, **graph_result, **adv_result, } if _HAS_GRAPH: rdf_h = dict(_rdf_h_cpp(pts, cutoff, n_bins)) hs = float(rdf_h["h_spatial"]) rdf_dev = float(rdf_h["rdf_dev"]) graph_result = dict(_cpp_graph_metrics(pts, radii, cov_scale, en_vals, cutoff)) else: hs = compute_h_spatial(pts, cutoff, n_bins) rdf_dev = compute_rdf_deviation(pts, cutoff, n_bins) graph_result = _compute_graph_ring_charge(atoms, pts, radii, cov_scale, cutoff, en_vals) return { "H_atom": ha, "H_spatial": hs, "H_total": w_atom * ha + w_spatial * hs, "RDF_dev": rdf_dev, "shape_aniso": compute_shape_anisotropy(pts), **compute_steinhardt(pts, [4, 6, 8], cutoff), **graph_result, **_compute_adversarial(NeighborList(pts, cutoff)), }
# --------------------------------------------------------------------------- # Filter helper # ---------------------------------------------------------------------------
[docs] def passes_filters( metrics: dict[str, float], filters: list[tuple[str, float, float]], ) -> bool: """Return ``True`` iff *metrics* satisfies every (metric, lo, hi) filter.""" for metric, lo, hi in filters: v = metrics.get(metric, float("nan")) if math.isnan(v) or not (lo <= v <= hi): return False return True
# --------------------------------------------------------------------------- # Angular entropy (diagnostic -- not in ALL_METRICS, not in XYZ comment) # ---------------------------------------------------------------------------
[docs] def compute_angular_entropy( positions: list[Vec3], cutoff: float, n_bins: int = 20, ) -> float: """Mean per-atom angular entropy of neighbor direction distributions. For each atom *i*, the directions to its neighbors within *cutoff* are projected onto the unit sphere. The polar angle theta distribution is histogrammed and its Shannon entropy is computed. The result is averaged over all atoms that have at least one neighbor. A value close to ln(*n_bins*) indicates a near-uniform (maximum-entropy) angular distribution -- neighbors are spread evenly over the sphere. A low value indicates clustering of neighbors in certain directions, i.e. accidental local order. This metric is intended as a diagnostic for the ``maxent`` placement mode and is **not** included in ``ALL_METRICS`` or in XYZ comment lines. Uses ``scipy.spatial.cKDTree`` for O(N*k) pair enumeration instead of a full O(N^2) distance matrix. Parameters ---------- positions: Cartesian coordinates (Å). cutoff: Neighbor distance cutoff (Å). n_bins: Number of histogram bins for the theta distribution (default: 20). Returns ------- float Mean per-atom angular Shannon entropy. Returns 0.0 for structures with fewer than two atoms or no neighbors within *cutoff*. """ pts = np.array(positions, dtype=float) n = len(pts) if n < 2: return 0.0 tree = _cKDTree(pts) pairs = tree.query_pairs(cutoff, output_type="ndarray") if len(pairs) == 0: return 0.0 # Build directed (both-way) neighbor index rows = np.concatenate([pairs[:, 0], pairs[:, 1]]) cols = np.concatenate([pairs[:, 1], pairs[:, 0]]) diff = pts[rows] - pts[cols] # (n_bonds, 3) r = np.linalg.norm(diff, axis=1) safe_r = np.where(r > 0, r, 1.0) d_hat = diff / safe_r[:, np.newaxis] # (n_bonds, 3) theta = np.arccos(np.clip(d_hat[:, 2], -1.0, 1.0)) # (n_bonds,) entropies: list[float] = [] for i in range(n): nb_mask = rows == i nb_theta = theta[nb_mask] if len(nb_theta) < 2: continue counts, _ = np.histogram(nb_theta, bins=n_bins, range=(0.0, math.pi)) entropies.append(_shannon_np(counts.astype(float))) return float(np.mean(entropies)) if entropies else 0.0
# =========================================================================== # Adversarial metrics (added in v0.4.0) # =========================================================================== # These four metrics expose structural features that common GNN/descriptor # architectures treat as implicitly fixed (uniform coordination, isotropic # environments, sharp RBF shells, specific bond-angle peaks). They share # a single NeighborList constructed once in compute_all_metrics. # =========================================================================== def _compute_bond_angle_entropy(nl: NeighborList) -> float: """Three-body bond-angle Shannon entropy, averaged over all atoms. For each atom *j* with at least 2 neighbours within the cutoff, all pairwise angles ``theta_{ajb} = arccos(u_ja · u_jb)`` are histogrammed into 36 equal bins over ``[0, pi]``. The Shannon entropy of the per-atom histogram is averaged over all contributing atoms. When the C++ extension is available (``HAS_BA_CPP = True``) the fast ``bond_angle_entropy_cpp`` path is used. Otherwise a NumPy fallback is executed; both paths use 36 bins and produce identical results. Targets the GNN / spherical-harmonic bias: "bond angles concentrate at 109.5° / 120°." Range ----- ``[0, ln(36)]`` — 0 = single angle, ln(36) ≈ 3.58 = uniform. Parameters ---------- nl: Pre-built ``NeighborList`` for the structure. Returns ------- float """ if nl.n_pairs == 0: return 0.0 if _HAS_BA_CPP: return float(_bond_angle_entropy_cpp(nl.pts, nl.cutoff)) # NumPy fallback — bin count fixed at 36 to match C++ output. n_bins = 36 n = nl.n_atoms dh = nl.unit_diff # (2P, 3) — cached # Exclude coincident pairs: C++ skips d² < 1e-20; Python uses the same # threshold via all_dists so the two paths agree numerically. valid = nl.all_dists > 1e-10 entropies: list[float] = [] for j in range(n): mask = (nl.rows == j) & valid if mask.sum() < 2: continue vecs = dh[mask] cos_t = np.clip(vecs @ vecs.T, -1.0, 1.0) angles = np.arccos(cos_t[np.triu_indices(len(vecs), k=1)]) counts, _ = np.histogram(angles, bins=n_bins, range=(0, math.pi)) p = counts[counts > 0].astype(float) if p.sum() == 0: continue p /= p.sum() entropies.append(float(-np.sum(p * np.log(p)))) return float(np.mean(entropies)) if entropies else 0.0 def _compute_coordination_variance(nl: NeighborList) -> float: """Variance of per-atom coordination numbers. Targets the GNN aggregation bias: "coordination number is nearly constant (octet rule)." Uses ``NeighborList.deg`` (cached after first call). Range ----- ``[0, ∞)`` — 0 = all atoms have identical coordination. Parameters ---------- nl: Pre-built ``NeighborList`` for the structure. Returns ------- float """ if nl.n_atoms < 2 or nl.n_pairs == 0: return 0.0 return float(np.var(nl.deg)) def _compute_radial_variance(nl: NeighborList) -> float: """Mean per-atom variance of neighbour distances. For each atom *i*, computes ``Var(d_{ij})`` over all neighbours *j* within the cutoff, then averages over atoms that have at least one neighbour. Unlike ``RDF_dev`` (which measures global g(r) shape), this captures local shell disorder: atoms whose near and far neighbours coexist within the same cutoff sphere score high. Targets the RBF bias: "first/second coordination shells are well-separated." Uses ``NeighborList.dists_sq`` (cached) to avoid recomputing ``d²``. Range ----- ``[0, ∞)`` Ų — 0 = all neighbours equidistant. Parameters ---------- nl: Pre-built ``NeighborList`` for the structure. Returns ------- float """ if nl.n_atoms < 2 or nl.n_pairs == 0: return 0.0 n = nl.n_atoms deg = nl.deg mask = deg > 0 if not mask.any(): return 0.0 sdeg = np.where(mask, deg, 1.0) d = nl.all_dists d_sq = nl.dists_sq sum_d = np.bincount(nl.rows, weights=d, minlength=n) sum_d2 = np.bincount(nl.rows, weights=d_sq, minlength=n) mean_d = sum_d / sdeg mean_d2 = sum_d2 / sdeg per_atom_var = np.where(mask, np.maximum(0.0, mean_d2 - mean_d**2), 0.0) return float(per_atom_var[mask].mean()) def _compute_local_anisotropy(nl: NeighborList) -> float: """Mean per-atom relative shape anisotropy of local coordination tensor. For each atom *i* with at least one neighbour, builds the local covariance tensor ``T_i = (1/k_i) Σ_j u_ij u_ijᵀ`` and computes the relative shape anisotropy:: κ²_i = 1.5 × ||T_i||²_F / (tr T_i)² − 0.5 using the Frobenius-norm identity (no ``eigvalsh`` needed):: ||T||²_F = Txx² + Tyy² + Tzz² + 2(Txy² + Txz² + Tyz²) tr T = Txx + Tyy + Tzz Unlike ``shape_aniso`` (global inertia tensor), this is per-atom and captures local environments such as square-planar or linear sites where the vector sum is zero but the distribution is anisotropic. Targets the bias: "local environments are isotropic or follow a known symmetry group." Uses ``NeighborList.unit_diff`` and ``NeighborList.deg`` (both cached). The ``np.divide(out=, where=)`` pattern is used instead of ``np.where(cond, a/b, 0)`` to avoid a ``RuntimeWarning`` on ``tr² = 0`` elements (the latter evaluates the division for all elements before selecting). Range ----- ``[0, 1]`` — 0 = isotropic, 1 = perfectly rod-like (linear chain). Parameters ---------- nl: Pre-built ``NeighborList`` for the structure. Returns ------- float """ n = nl.n_atoms if n < 2 or nl.n_pairs == 0: return 0.0 deg = nl.deg mask = deg > 0 if not mask.any(): return 0.0 dh = nl.unit_diff # (2P, 3) cached rows = nl.rows dx, dy, dz = dh[:, 0], dh[:, 1], dh[:, 2] sdeg = np.where(mask, deg, 1.0) Txx = np.bincount(rows, weights=dx * dx, minlength=n) / sdeg Txy = np.bincount(rows, weights=dx * dy, minlength=n) / sdeg Txz = np.bincount(rows, weights=dx * dz, minlength=n) / sdeg Tyy = np.bincount(rows, weights=dy * dy, minlength=n) / sdeg Tyz = np.bincount(rows, weights=dy * dz, minlength=n) / sdeg Tzz = np.bincount(rows, weights=dz * dz, minlength=n) / sdeg tr = Txx + Tyy + Tzz fr2 = Txx**2 + Tyy**2 + Tzz**2 + 2.0 * (Txy**2 + Txz**2 + Tyz**2) kap2 = np.zeros(n) valid = mask & (tr * tr > 1e-24) np.divide(1.5 * fr2, tr * tr, out=kap2, where=valid) kap2[valid] -= 0.5 np.clip(kap2, 0.0, 1.0, out=kap2) return float(kap2[mask].mean()) def _compute_adversarial(nl: NeighborList) -> dict[str, float]: """Compute all four adversarial metrics sharing one ``NeighborList``. Parameters ---------- nl: Pre-built ``NeighborList`` for the structure. Returns ------- dict with keys ``bond_angle_entropy``, ``coordination_variance``, ``radial_variance``, ``local_anisotropy``. """ return { "bond_angle_entropy": _compute_bond_angle_entropy(nl), "coordination_variance": _compute_coordination_variance(nl), "radial_variance": _compute_radial_variance(nl), "local_anisotropy": _compute_local_anisotropy(nl), }