Principal Components Analysis (PCA) tutorial - Part 1

Mohamad

May 09, 2013 15:23

In this video, we'll introduce you to principal component analysis and how to conduct it in Excel with the help of NumXL software. This is the first tutorial out of several on the subject.

For more information, please visit Principal Component 101

Video script

Scene 1:

Hello and welcome to our first tutorial of the ongoing series on Principle Component Analysis, or PCA. In this video we'll start with a general definition, motivation, and applications of a PCA and then use the NumXL software to carry out such analysis. Afterward we will closely examine the different output elements in an attempt to develop a solid understanding of a PCA. This will pave the way to a more advanced treatment in future videos.

In this tutorial we will use the socio economic data provided in a 1976 book by Harry H Hartmann. The five variables represent total population, median school years, total employment, miscellaneous professional services, and median house value. Each observation represents one of the twelve census tracts in the Los Angeles standard metropolitan statistical area.

Now before we start, let's try to answer a few questions. What is principal component analysis? PCA is a technique that takes a set of correlated variables and linearly transforms those variables into a set of uncorrelated factors. You can think of PCA as an axis system transformation.

Scene 2:

Let's examine this plot of two correlated variables. Simply put from the x and y cartesian system, the data points are highly correlated. By transforming or rotating the axis into Z and W the data points are no longer correlated. In theory, the PCA finds that those transformations of data points will look uncorrelated.

Scene 3:

Now let's start!

First we will organize our input data, we'll place the values of each variable in a separate column, and each observation in a separate row. Note that the magnitude of the variables vary significantly, so any analysis of the raw data will be biased toward the variables with a larger scale.

Next select an empty cell in your worksheet where you wish the output to be generated, then locate and click on the PCA icon in the NumXL tab.

Scene 4:

The principal component analysis wizard pops up. Select the cells range for the five input variables. Note that first the input cells range includes the heading or the label cell which will be used in the output tables, and secondly the input variables are already grouped by columns. Each column represents a variable, so we don't need to change anything there. Let's leave the variable mask field blank for now, we will revisit this field in later tutorial videos. Also by default the output cells range is set to the current selected cell in your worksheet.

Once we select the x and y cells range, the options and missing values tabs become available. Click on the options tab.

Scene 5:

Initially the tab is set to the following values. The standardized input is checked. This option replaces the values of each variable with its standardized version. This option overcomes the bias issue when the values of the input variables have different magnitude scales. Leave this option checked. The principle component output is also checked, this option instructs the wizard to generate PCA related tables. Leave it checked as well. The significance level is set to 5%. Lastly, the input variables is unchecked, we'll leave this unchecked for now.

Under principal component, check the values option to display the values for each principal component. Now you can click on the missing values tab.

Scene 6:

In this tab you can select an approach to handle the missing values in your dataset. By default any missing value found in any of the input variables would exclude the observation from the analysis. This treatment is a good approach to what we're doing, so let's leave it unchanged. Now click OK to generate the output tables.

Scene 7:

Let's look at the PCA statistics table. The principal components are ordered and named according to their variances in a descending order. In the second row the proportion statistics explain the percentage of variation in the original dataset that each principal component captures or accounts for. The cumulative proportion is a measure of total variation explained by the principal components up to this current component. In our example the first 3 PC account for 94.3% of the variation in the five variables. Note that the sum of the variances of the PC should yield a number of input variables which in this case is five.

Scene 8:

Now over at the loadings table we outlined the weights of a linear transformation from the input variable coordinate system to the principal components. For example the linear transformation is expressed as follows, note that the squared loadings add up to 1.

Scene 9:

Now in this graph we've plotted the loadings for our input variables in the first three components. The median school years, the miscellaneous professional services, and median house value variables have comparable loadings in PC 1. After that comes total employment loading and finally the population. One may propose this is a proxy for the wealth income factor.

Scene 10:

Interpreting the loadings for the input variables in their remaining components proved to be more difficult and require a deeper level of domain expertise. In the PC values table we calculated the transformation output value for each dimension, so the first row corresponds to the first data point and so on. By definition the values in the PCs are uncorrelated. You can verify it by computing the correlation matrix.

Scene 11:

That is it for now, thank you for watching!

Related articles

Comments