Jan 12, 2019       tags: | math notes | visualisation |

Power iteration algorithm — a visualization

The power method is a simple iterative algorithm that can be used to find the eigenvector of a matrix (\(\mathbf T\)) associated with the largest absolute eigenvalue. Given some vector \(\mathbf v_i\), the next best approximation of the eigenvector is given by the normalized product of \(\mathbf T\) and \(\mathbf v_i\)

\[\mathbf v_{i+1} = {\mathbf T \mathbf v_i \over \lVert \mathbf T \mathbf v_i \rVert}\]

Why does it work? Let’s consider the simplest case of a diagonizable matrix \(\mathbf T\). Using eigendecomposition,

\(\mathbf T = \mathbf E \mathbf \Lambda \mathbf E^{-1}\),

where \(\mathbf E\) is the matrix of eigenvectors, and \(\mathbf \Lambda\) is the the diagonal matrix of eigenvalues

We can ignore normalization for now since it does not affect the direction of the resulting vector. Then power iteration simplifies to calculating \(\mathbf T^n \mathbf u\) where \(\mathbf u\) is some starting vector

\[\mathbf T^n = \mathbf E \mathbf \Lambda \mathbf E^{-1} \mathbf E \mathbf \Lambda \mathbf E^{-1} ... \mathbf E \mathbf \Lambda \mathbf E^{-1} = \mathbf E \mathbf \Lambda^n \mathbf E^{-1}\]

Given some vector \(\mathbf u\), \(\mathbf T^n \mathbf u = \mathbf E \mathbf \Lambda^n \mathbf E^{-1} \mathbf u\), where \(\mathbf E^{-1} \mathbf u\) can be interpreted as presenting vector \(\mathbf u\) in the eigenbasis.

Since we don’t care about the legnth of the eigenvector at this point, we are free to scale \(\mathbf \Lambda^n\) by any number. Let us scale it by \(\lambda_{max}^{-n}\)

\[{1 \over \lambda_{max}^n} \mathbf \Lambda^n = {1 \over \lambda_{max}^n} \begin{bmatrix} \lambda_1^n & 0 & 0 & 0 \\ 0 & \lambda_2^n & 0 & 0 \\ 0 & 0 & ... & 0 \\ 0 & 0 & 0 & \lambda_n^n \end{bmatrix} = \begin{bmatrix} {\lambda_1^n \over \lambda_{max}^n} & 0 & 0 & 0 \\ 0 & {\lambda_2^n \over \lambda_{max}^n} & 0 & 0 \\ 0 & 0 & ... & 0 \\ 0 & 0 & 0 & {\lambda_n^n\over \lambda_{max}^n} \end{bmatrix}\]

At a sufficiently large n, all entries but one become close enough to zero. The entry corresponding to the largest eigenvalue in \(\Lambda\) remains 1. This matrix has the property of stripping vector \(\mathbf u\) (presented in the eigenbasis!) of all coordinates but the one corresponding to the eigenvector with the largest eigenvalue thus giving us a vector collinear with that eigenvector. Finally, applying matrix \(\mathbf E\) to this new vector presents it in our original basis.

What if we picked \(\mathbf u\) normal to the eigenvector thought? The power iteration algorithm would fail, since vector \(\mathbf E^{-1} \mathbf u\) produces 0 as the coordinate corresponding to the dominant eigenvector. Applying \(\mathbf \Lambda^n\) will produce a zero vector. Fortunately, if vector \(\mathbf u\) is picked at random, it is unlikely for it to be normal to the dominant eigenvector.

Is it possible to use power iteration to find a non-dominant eigenvector? Yes. Substituting \(\mathbf T\) with \(\mathbf T - \lambda_{max} \mathbf I\) makes it possible to obtain the eigenvector associated with the smallest eigenvalue. It is not difficult to show that the eigendecomposition of this new matrix is \(\mathbf E (\mathbf \Lambda - \lambda_{max} \mathbf I) \mathbf E^{-1}\). Scaling \((\mathbf \Lambda - \lambda_{max} \mathbf I)^{n}\) by \((\lambda_{min} - \lambda_{max})^{-n}\) will result in a matrix that nullifies all vector coordinates except the one associated with the least significant eigenvector.

In the canvas below (powered by vtvt) you can visualize 20 iterations of the power method. Vectors \(\mathbf t_1\) and \(\mathbf t_2\) set the columns of \(\mathbf T\), and the eigenvectors are calculated analytially and updated whenever the matrix changes. Vector \(\mathbf u_0\) sets the initial iteration of the power method. Note how much worse convergence becomes whenever the eigenvalues become similar.

Suggested post:

The real reason you use the MSE and cross-entropy loss functions
If you learned machine learning from MOOCs, there's a good chance you haven't been taught the true significance of the mean squared error and cross-entropy loss functions.