A General Matrix Decomposition

4 minute read

Published:

There’s an interesting, general way to decompose any matrix. The method consists of constructing a sequence of rank one approximations and is just a simple generalization of the singular value decomposition (SVD). In fact, SVD is the optimal version of method in the sense of requring the least amount of rank 1 approximations.

Consider any matrix $A \in \mathbb{R}^{m\times n}$. Then, let $\sigma u v^T$ be a rank one matrix and consider minimizing the Frobenious norm of the difference $\|A - \sigma u v^T \|^2_{F}$.

First expand,

\[\begin{split} \frac{\|A - \sigma u v^T\|^2_{F}}{2} & = \frac{\|A\|_F^2}{2} + \frac{\|\sigma u v^T\|_F^2}{2} - \sigma \cdot u^T A v \\ & = \frac{\|A\|_F^2}{2} + \frac{\sigma^2 \|u\|_2^2 \|v\|_2^2}{2} - \sigma \cdot u^T A v \end{split}\]

and can be optimized with respect to $\sigma$ easily. Setting the derivative to 0 gives

\[0 = \sigma_* \|u\|_2^2 \|v\|_2^2 - u^T A v \implies \sigma_* = \frac{u^T A v}{\|u\|_2^2 \|v\|_2^2}\]

So long as $\sigma \neq 0$ we’ll have $|A - \sigma u v^T|^2_{F} \leq |A|_F^2$. This is easy to see as

\[\begin{split} \|A - \sigma u v^T\|^2_{F} & = tr(A^TA) - \frac{\sigma^2_* \|u\|_2^2 \|v\|_2^2}{2} - \sigma_* \cdot u^T A v \\ & = tr(A^TA) - \left ( \frac{u^T A v}{\|u\|_2 \|v\|_2} \right)^2 \\ & = \sum_{j=1}^{rank(A)} \lambda_i^2 - \left ( \frac{u^T A v}{\|u\|_2 \|v\|_2} \right)^2 \\ & \leq \sum_{j=1}^{rank(A)} \lambda_i^2 \\ & = \|A\|_F^2 \end{split}\]

Note that there aren’t any absolute values needed as $\left ( \frac{u^T A v}{\|u\|_2 \|v\|_2} \right)^2\leq \lambda_1^2 \leq {\sum}_{j=1}^{rank(A)} \lambda_i^2$. Which is true as $\lambda_1 = \|A\|_{op} = \sup \frac{\|Av\|}{\|v\|}$.

Thus, we get the following simple algorithm for decomposing any matrix $A$.

A = Matrix(m,n)
R = A

procedure
	while frobenius_norm(R) < tol   
	    u = Vector(m)
	    v = Vector(n)
	    sigma = transpose(u) * A * v / (norm(u) * norm(v))

	    R = R - sigma * u * transpose(v) 
 	end while	
end procedure 	

Say that the procedure runs for $k$ iterations. Then, collecting all the $\sigma_i$ such that $\Sigma = \text{diag}(\sigma_1, \dots, \sigma_k)$ the $u_i$ into $U = [u_1, \dots, u_k]$ and the $V = [v_1, \dots, v_k]$. Then, we’ll have, for tolerance $\epsilon$

\[\|A - U\Sigma V^T\|_F < \epsilon\]

In python

import numpy as np
from numpy.linalg import norm

m = 100
n = 20

# This is the matrix I want to approximate
A = np.random.randn(m,n)

k = 2000

R = A
for _ in range(k):
    # These vectors will approximate them
    u = np.random.randn(m,1)
    v = np.random.randn(n,1)
    
    # Calculate sigma
    sigma = u.transpose() @ R @ v / (norm(u)**2 * norm(v)**2)
    
    # Update the error
    R = R - sigma * u @ v.transpose()    

In the case of SVD, $k = \text{rank}(A) \leq \min(m,n)$ . However, in the general case, there really is no upper bound on $k$ as you could always construct $u,v$ to make $\sigma$ small and thus convergence slow. The fun part of this algorithm is the flexibility it gives in choosing $U,V$. They can be nearly anything you want. So, they could be sampled randomly, chosen for nice numerical properties, or, if you wanted, chosen horribly. It is also very fast as it doesn’t require any eigendecompositions. Sometimes fast and not too accurate is better than slow and optimal.

I came across this method looking for a way to decompose some matrix $A$ such that $A = U \Sigma V^T$ where $U, V$ are stored in low precision, while allowing $\Sigma$ to be stored in full precision. This method allows you to take the columns of $U,V$ to be in $\{-1,0,1\}$.

But… I never ended up using it because there turns out to be a more refined approach that does exactly the same thing. As often happens, it turns out that O’Leary and Peleg already described a — more refined — version of this idea back in 1983 [1]. Nonetheless, it is still a pretty nifty bit of mathematics. Most people have a pretty good idea how SVD works, but I never realized that a version of the same idea works for almost any vectors you want — so long as they are not in $null(A)$ or $null(A^T)$.

References

[1] D. O’Leary and S. Peleg, “Digital Image Compression by Outer Product Expansion,” in IEEE Transactions on Communications, vol. 31, no. 3, pp. 441-444, March 1983, doi: 10.1109/TCOM.1983.1095823.