Estimating the trace of large matrices efficiently is a key primitive in many machine learning tasks—e.g., computing divergence in continuous normalizing flows or approximating log-determinants in Gaussian processes. Hutchinson’s estimator uses random probes to give unbiased trace estimates but requires $O\bigl(1/\varepsilon^2 \bigr)$ matrix–vector multiplications to achieve relative error $\varepsilon$. Hutch++ enhances this to $O\bigl(1/\varepsilon \bigr)$ by combining a one-time low-rank deflation with a few classical Hutchinson probes.
1. Scoop-and-Weigh
Imagine a jar of mixed coins—pennies, nickels, dimes, and quarters—and you want its total value without counting each coin:
- Vanilla Hutchinson: scoop random handfuls and weigh them; each scoop is noisy, so many scoops are needed.
- Hutch++: first pull out and count all quarters (the “heavy hitters”) exactly, then take a few handfuls of the remaining small-change. Fewer handfuls suffice because the leftovers are more uniform.
By removing the largest contributors first, Hutch++ drastically reduces the number of random probes.
2. Algorithm
Let $A \in \mathbb{R}^{n\times n}$ be symmetric, with trace $\tau = \mathrm{tr}(A)$. We target relative error $\epsilon$. Hutch++ runs in two stages:
2.1 Low-Rank Deflation (“Pulling Out the Quarters”)
- Sketch matrix: draw $S \in \mathbb{R}^{n\times m_{1}}$ with $m_{1} = O(1/\epsilon)$ random Gaussian columns.
-
Amplify: compute \[ Y = A S, \] using $m_{1}$ mat–vecs. Each column $y_{i}=A s_{i}$ is biased toward top eigen-directions.
- Why? Write each random $s_{i}$ in the eigenbasis ${v_{j}}$: \[ s_{i} = \sum_{j=1}^{n} (v_{j}^\top s_{i})\,v_{j}, \] and then \[ y_{i} = A s_{i} = V\Lambda V^\top s_{i} = \sum_{j=1}^{n} \lambda_{j}(v_{j}^\top s_{i})\,v_{j}. \] Each component along $v_{j}$ is scaled by $\lambda_{j}$, so large eigenvalues dominate $y_{i}$.
-
Orthonormalize: perform QR on $Y$ to get $Q \in \mathbb{R}^{n\times m_{1}}$ with orthonormal columns. Theory shows \[ \mathrm{span}(Q) \approx \text{span of top-}m_{1}\text{ eigenvectors of }A. \]
-
Exact trace of low-rank part: form the small matrix \[ B = Q^\top (A Q), \] then compute \[ \tau_{\mathrm{low}} = \mathrm{tr}(B). \]
This captures the “quarters.” The residual matrix is \[ R = P A P, \quad P = I - QQ^\top, \] where $P$ is an orthogonal projector onto the subspace left over after removing span($Q$). A projector satisfies \[ P^{2} = P, \quad P^{\top} = P. \] Intuitively, for any vector $x$, $P x = x - QQ^\top x$ subtracts out its projection onto span($Q$), leaving the component orthogonal to those top directions.
2.2 Residual Estimation (“Few Handfuls of Small Coins”)
Perform $m_{2}=O(1)$ Hutchinson probes on $R=PAP$:
- Project a random $v_{j}$: \[ z_{j} = P v_{j}. \]
- Probe: \[ X_{j} = z_{j}^\top (A z_{j}) = z_{j}^\top R\,z_{j}. \]
- Average for an unbiased estimate: \[ \tau_{\mathrm{res}} = \frac{1}{m_{2}}\sum_{j=1}^{m_{2}} X_{j} \approx \mathrm{tr}(R). \]
2.3 Combine
The final estimate is \[ \widehat{\tau} = \tau_{\mathrm{low}} + \tau_{\mathrm{res}}, \] using $O(1/\epsilon)$ mat–vecs total instead of $O(1/\epsilon^{2})$.
3. Why It Works: Capturing the Top Eigenspace
- Eigen-decomposition: \[ A = V\Lambda V^\top, \quad \Lambda = \mathrm{diag}(\lambda_{1} \ge \lambda_{2} \ge \dots \ge \lambda_{n}). \]
- Random sketch columns $s_{i}$ expand as above; $y_{i}=As_{i}$ amplifies large $\lambda_{j}$.
- QR → $Q$: orthonormalizing $Y$ extracts the dominant subspace.
With $m_{1}=O(1/\epsilon)$, \[ \mathrm{tr}(P A P) = \sum_{j>m_{1}} \lambda_{j} \le \epsilon \sum_{j=1}^{n} \lambda_{j} = \epsilon\tau. \]
4. Detailed Error Analysis
-
Low-rank error $\Delta_{\mathrm{low}}$: Ideal deflation misses \[ \Delta_{\mathrm{low}} = \tau - \tau_{\mathrm{low}} = \mathrm{tr}(P A P) = \sum_{j>m_{1}} \lambda_{j}. \] By choosing $m_{1}=O(1/\epsilon)$ via randomized sketches, one ensures \[ \Delta_{\mathrm{low}} \le \epsilon\tau \] with high probability.
-
Residual error: The Frobenius norm of $R$ satisfies \[ \lVert R\rVert_F \;=\;\lVert P\,A\,P\rVert_F \;\le\;\sqrt{\mathrm{rank}(R)}\,\lVert A\rVert_F \;=\;O(\sqrt{\epsilon})\,\lVert A\rVert_F. \] Thus a constant number $m_{2}$ of probes yields \[ |\tau_{\mathrm{res}} - \mathrm{tr}(R)| = O(\epsilon\tau). \]
-
Overall: combining both, \[ |\widehat{\tau} - \tau| = O(\epsilon\tau), \] i.e.\ $\widehat{\tau}=(1\pm O(\epsilon))\,\tau$.
5. Practical Considerations
- Spectral decay: fast decay maximizes gains; flat spectra reduce benefits.
- Preprocessing cost: QR on $Y$ and storing $Q$ add overhead—best when $n$ large and $\epsilon$ small.
- Tuning: select $m_{1},m_{2}$ to balance deflation vs. sampling.
- Projector recap: $P=I-QQ^\top$ satisfies $P^{2}=P$, projecting out span($Q$).
6. Conclusion
Hutch++ combines low-rank deflation (“lift out the quarters”) and Hutchinson probes (“handfuls of coins”) to reduce trace-estimation from $O(1/\epsilon^{2})$ to $O(1/\epsilon)$, offering an efficient, unbiased, scalable method for high-accuracy trace estimation.