PCA is great because you can reduce a data matrix to a lower dimension without losing much. Although it is widely used, PCA doesn’t work well when there are noises in the input data. This is because the objective function doesn’t really incorporate the fact that the input might be noisy. As the name suggests, Robust PCA is a variant of PCA that is more robust against noises. It was efficiently solved by Candès et al. in 20111.
*If you want a quick review on PCA, read my previous post and then come back.
The goal of Robust PCA is . This goal can be best illustrated with the image below:
The intuition is that noisy images should be able to be decomposed into (underlying less noisy image) and (noise). Just like PCA, we would like to be of lower rank. Furthermore, since is a noise, we want it to be sparse.
Objective Function: PCP
How can we formulate this intuition in terms of an objective function? First thought would be to try this:
The first terms ensures is low rank and the second term ensures is sparse. Note that is the norm, the number of non-zero elements in . If you’ve learned optimization before, you will know that this is not a good objective function. This is because the objective is neither continuous nor convex.
The key insight Candès provided in his paper is that we can proxy this with what he named Principal Component Pursuit (PCP):
Lower rank is now enforced by the nuclear norm, which is essentially the sum of singular values of . Sparsity is enforced by the norm, which I hope you’re familiar with.
Candès showed that this proxy optimization function has the following property:
If is sufficiently low rank but not sparse AND is sufficiently sparse but not low rank, then can be recovered exactly with . (Theorem 1.1)
This is amazing, because decomposing into and in an unsupervised fashion seems very very hard. The theorem states that this is possible. Section 2 and 3 of his paper proves this theorem. Section 4 empirically shows that PCP can reconstruct. I will not cover these sections in this post, but I’ll leave a picture of the experimental result from the original paper here:
How to Solve PCP
PCP is convex, so we can solve it for example using interior point methods. However, the author claims that this is not fast enough (with run time). In section 5, he introduces augmented Lagrange multiplier (ALM), which I found is interesting.
Before discussing how we can solve PCP, let’s briefly learn what ALM does. Say we have the following optimization problem:
There are two ways to solve this:
1. Langrange multiplier
We can reformualte the above problem using a Langrange multipler and solve instead:
2. Penalty method
A less known approach is called the penalty method. The basic idea is that we can solve the above objective function by solving repeatedly
Here, we are using a quadratic loss function , but it can be linear, cubed, anything. The algorithm is as follows:
- Solve this optimization problem.
- Increment . ( is commonly used).
- Solve the next optimization problem starting from from the previous step.
- Repeat until convergence.
ALM = Langrange + Penalty Method
ALM basically combines these two by iteratively solving the following, updating each iteration until convergence:
PCP as ALM
Now, we can formulate PCP in ALM form using matrix notations:
where F is the Frobenius norm (square root of the sum of squares of matrix elements) and is the lagrange multipliers.
In practice, we need to alternately optimize . Closed form update formula for these are quite straightforward to derive.
Robust PCA is regarded as one of the master pieces in machine learning papers. I skipped the proof of theorem 1.1 entirely, but I hope you at least got the gist of what it is optimizing for, and how it is done.