Feature Selection

Sometimes in the dataset, there are lots of features. It might contain relevant as well as irrelevant features. The irrelevant features leads to increase in training time as well as inefficient model after training. To avoid this problem, we should use feature selection. One way is to use dimensionality reduction using Principle Component Analysis (PCA).

PCA is based on linear transformation of the original feature space Χ of d dimensions into another feature space Z of q dimensions. This transformation is carried by:

$z = A^Tx$ , where x ∈ X, z ∈ Z

Here A is a d x q transformation matrix.

Lets say we have 2-dimensional data as shown and want to convert it to 1-Dimensional. We need to define a line so that we can take the projection of the data elements on that to reduce the dimension and the sum of the length of the projections should be minimum. Basically we need to find a vector u as shown in figure below. Similarly, if 3-D dataset is given and we want to convert it to 2-D data, we need to define two vectors to define a plane for the projection of 3-D points on that.

To do this, we make use of covariance matrix. Suppose X is the data. The covariance matrix is given by:

${\sum = (1/m) \sum_{i=1}^{n}(x^i)(x^i)^T}$

For a dataset with n features, covariance matrix is the n*n matrix that contains the covariance information for all possible pairs of dimensions where covariance is the relationship between two dimensions whether they grow simultaneously or not. Using the covariance matrix, we find Eigen Vectors and choose first k columns of that vector. This n X k matrix is multiplied with original data to get the transformed vector with reduced dimensions.

Eigen Vectors: A matrix transforms one vector into another i.e. It changes both the size and the direction of a vector. But there are few special vectors, for which the matrix will never change the direction of the vector but can either increase or decrease the length of the vector. In this aspect, the eigen vectors of a matrix define the axes of the matrix itself. So those vectors and their corresponding eigen values are sufficient to describe some characteristics of the whole matrix. Roots is to polynomial equation as eigen vectors/values are to square matrices. Eigen vectors of a covariance matrix is that they are the directions in which the data varies the most. More precisely, the first eigenvector is the direction in which the data varies the most, the second eigenvector is the direction of greatest variance among those that are orthogonal (perpendicular) to the first eigenvector, the third eigenvector is the direction of greatest variance among those orthogonal to the first two, and so on. :coursera :quora

Eigen Vector in 3 dimension

After calculating covariance matrix (sigma), we can use a command ‘svd’ to generate the eigen vectors.

    [U,S,V] = svd(sigma)

The vector u is the eigen vectors. It looks like this

Just choose the first k vectors, whatever dimensions you need. Here is the complete procedure:


    %After mean normalization and feature scaling
    sigma = (1/m) * X' * X; %Covariance Matrix
    [U,S,V] = svd(sigma);   %Eigen vector
    Ureduce = U(:,1:k);     %First k-dimension 
    z = Ureduce' * X;       %Reduced dimensions dataset

How to choose value of k ?

While transforming a matrix into lower dimensional matrix, care should be given that not much details of data is lost. We should try to retain 99% variance of data. How to do that ?

Average Squared Projection Error – ${(1/m) \sum_{i=1}^{m} || x^{(i)} - x_{approx}^{(i)} ||^2}$
Total variation in data – ${(1/m) \sum_{i=1}^{m} || x^{(i)}||^2}$

Choose k value such that ${\frac{Average \ Squared \ Projection \ Error}{ Total \ Variation \ in \ data} \leq 0.01}$ You can choose 0.05 or 0.1 instead of 0.01.

There is an easy way to check this in matlab. When you calculate eigen vector using svd command, U vector is the eigen vector, where as S contains the information of variation retained. Here is how S looks

And this is how you check, for given k, $\LARGE 1 - \frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^{m} S_{ii}} \leq 0.01$

Applications

Compression – If the data is too large, it can be compressed using PCA and also it can be used to speed up the time.
Visualization – If a dataset contains more than 3 dimensions, it is difficult to visualize the distribution of data. In that case, PCA can be an useful technique.

A very useful technique covered today. Hope to use it soon in challenges. HAVE FUN 🙂

About Learning

Programming and Machine Learning

Feature Selection

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply