VC Dimension

While working on a data and building a learning model, its very important to know that how complicated is the model. Will it perform accurately on new data elements. Vapnik-Chervonenkis showed that the capacity of the model can be calculated before. They gave a new concept of VC dimension which can predict upper bound on the test error of a classification model. Vapnik showed that

where h is the VC dimension of the classification model and N is the size of the training dataset. Now the problem is that for a given model how do we define and find h, the VC dimension. Lets see.

Lets first get idea about indicator function, $i_F(x,w)$ . An indicator function can have only two values {0,1} or say {-1,1}. It is defined by a decision boundary or a separating function e.g. $i_F(x,w) = sign(x^Tw)$ . In case of two-class classification tasks, the VC dimension of a set of indicator function $i_F(x,w)$ is defined as the largest number h of points that can be separated in all possible ways. For two-class, a set of l points can be labeled in $2^l$ possible ways. If there exist an indicator function which can separate all points correctly, the VC dimension of this function h = l. For a 2-D input vector, set of planes can be defined as $u = w_1x_1+w_2x_2 + w_0$ or $u = (x^Tw)$ . The indicator function will be $i_F(x,w) = sign(u)$ . For three points, the separation can be shown as

As we can see, the three points are separated in all $2^3 = 8$ ways. So for this $i_F(x,w)$ , h = 3. Although, this doen not mean that every set of 3 points can be separated by a given set of indicator function. For example

The first figure shows three co-linear points which cannot be separated by an indicator function $i_F(x,w) = sign(x^Tw)$ . The right side figure shows four points whih cannot be separated by $i_F(x,w) = sign(x^Tw)$ . There is no arrangement for four points in a 2-D space whose all possible labelings can be separated correctly. Try it. So VC dimension for indicator function $i_F(x,w) = sign(x^Tw)$ in a 2-D space of inputs is 3. In an n-dimensional input space, the VC dimension $i_F(x,w) = sign(x^Tw)$ is equal to n+1.

From above relation we can see that VC dimension increases as the number of weight vector parameters increases. But this doesn’t mean that learning machine with many parameters will have a high VC dimension and so on. For example $i_F(w,x) = sign(sin(wx))$ has inifinite VC dimension for just one parameter. The practical consequence is that $i_F(w,x) = sign(sin(wx))$ not a good candidate for this dichotomization task because this particular indicator function is able to shatter (to separate or to overfit) any training data set.

Why VC dimension?

Suppose you want to classify a data. Now what do you do? One way is that you try every possible classifier, a line, a plane or hyperplane until you get the best. The other way is that you already know a function that can separate(may be not correctly as discussed above) the data. So you use that function as your hypothesis which can minimize the test error.

Structural Risk minimization

SRM is basically choosing a hypothesis of right complexity from a large number of possible hypothesis, for training data elements which can be done by restricting the hypothesis space H of approximating functions. These learning machines will form a nested structure as following:
$H_1 \subset H_2 \subset H_3 \subset ... \subset H_{n-1} \subset H_n \subset ... \subset H$
$H_n$ may be a set of polynomials in one variable of degree n or a neural network with n hidden layer neurons. The goal of learning is to choose an optimal polynomial degree or an optimal number of hidden layer neurons and so on. The optimal choice of model capacity ensures the minimization of generalization error. There are various generalization bounds like training error, VC dimension h, number of training samples and probability $1 - \eta$ for all approximating functions for both regression and classification. For binary classification this generalization bound is given by

The complex term inside the bracket is called a VC Confidence or Confidence Interval. The above equation shows that when the number of training data increases, $N \rightarrow \infty$ , test error is very close to training error or if probability $1 - \eta$ approaches 1, the generalization bound grows large. This means that any learning machine obtained from a finite number of training data cannot have high confidence level or high VC dimension. Also the VC confidence interval increases with an increase in a VC dimension h for a fixed number of the training data pairs N.

At this point, it should be clear how SRM works – it uses VC dimension as a controlling parameter for minimization of generalization bound(test error). There are basically two approaches while designing statistical learning from data models that is two ways to minimize the right-hand side term.

Choose appropriate structure and keeping the confidence interval fixed, minimize the training error.
Keeping the value of the training error fixed, minimize the confidence interval.

Neural Networks uses first approach while Support Vector Machines implements second strategy.

The last post about Data and now this one is the basic idea needed in statistical learning which is worth learning when dealing with Machine Learning. It is for sure gonna help while understanding other approaches and techniques. Keep Learning.

About Learning

Programming and Machine Learning

VC Dimension

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply