Boosting != Boosting Cascade

In machine learning literature, boosting and support vector machine (SVM) are the most popular algorithms out of box.

The emergence of boosting, which originated from almost an engineering-based perspective (multiple weak classifiers combine together to yield a strong one), is earlier, and less mathematically deliberate than SVM.

Upon the introduction of rapid face detection by Viola and Jones, the use of boosting, at least in the CV context, almost bears the same meaning as fast and simple. This is, however, not quite the exactly true.

First of all, the 2001 paper by Viola and Jones uses Haar feature, a feature that can be computed very fast from integral image, which is also easily obtained from a gray image.

Secondly, they employ a modified boosting algorithm that uses 1 stage at a time. As a reminder, suppose that the boosting score is expressed as a weighted voting of individual classifiers

h_m(\mathbf x) = \alpha_1 h_1(\mathbf x) + \alpha_2 h_2(\mathbf x) + \cdots + \alpha_m h_m(\mathbf x)

This expression is by no means simpler than the SVM computation. It does not even differ from any monolithic classifier.

The two reasons above bankrupt the idea that boosting == fast.

The boosting cascade introduced by VJ is trained in a way such that the classification is divided into multiple stages. Each time only one test h_i(\mathbf x) is considered. If the test passes, go to the next test; otherwise reject. Only the candidate that passes all m rounds is regarded as positive. This is effectively a degenerated decision tree, which boosts the classification speed by rejecting many candidates that fail at early stages.

Therefore, boosting cascade == fast.


My first CVPR paper, yay

Automatic Adaptation of a Generic Pedestrian Detector to a Specific Traffic Scene

Regression vs. Classification (repost)

(originally from this post)

For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered.

Binary classification is learning a classifier c: \mathbf X \rightarrow \{0,1\} so as to minimize the probability of being wrong, \text{Pr}_{\mathbf x,y \sim D}(c(\mathbf x) \sim y).

The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s: \mathbf X \rightarrow \left[0,1\right] so as to minimize squared error, \mathbb E_{\mathbf x,y \sim D} (s(\mathbf x)-y)^2.

It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions.

These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|\mathbf x), then threshold at 0.5. For this simple reduction a squared error regret of r implies a classification regret of at most r^{0.5}. Regression can also be used to reduce to classification using the Probing algorithm. (This is much more obvious when you look at an updated proof.) Under this reduction, a classification regret of r implies a squared error regression regret of at most r.

Both primitives enjoy a significant amount of prior work with (perhaps) classification enjoying more work in the machine learning community and regression having more emphasis in the statistics community.

The (nonhistoric) reasons for preferring classification seem to be:

Aesthetically, what could be a more natural primitive than predicting a single bit?
Classification is (potentially) inherently more representationally concise. When translated into transistors, a classification algorithm might use fewer transistors than a regressor, simply because the natural representation is bits rather than reals (~= floats).
There are several reasons to consider regression over classification:

More uniform convergence. For a single squared error regressor, the rate of convergence of the estimate of squared error to the expected squared error goes as 1/m, assuming IID samples. For a single binary classifier, the rate of convergence of the estimate of binary loss to the expected binary loss goes as a function varying between 1/m and 1/m^{0.5}.
There is a computational inequivalence between the primitives, as far as we know. In particular, the probing algorithm must call a classifier several times in several ways to make a high precision regression prediction. On the other hand, classification via regression requires one call to the underlying regressor.
Squared error regression learning is often easier than 0/1 classification learning. This is because squared error regression is convex, but 0/1 loss is not. Note: this does not imply that squared error regression is convex (It isn’t for general regression algorithms). Instead, it just means that nonconvexity is not enforced by the loss function.
The mathematical evidence points toward squared error regression as a better primitive, although doubts still seem reasonable to entertain.