This is the second entry in our principal components analysis (PCA) series. In this tutorial, we will resume our discussion on dimension reduction using a subset of the principal components with a minimal loss of information. We will use NumXL and Excel to carry out our analysis, closely examining the different output elements in an attempt to develop a solid understanding of PCA, which will pave the way to a more advanced treatment in future issues
For more information, visit our PCA 101 article.
Hello and welcome to our Principal Component Analysis series. In this second tutorial we'll continue discussing principal component analysis or PCA. To watch the first video which explains in detail what PCA is you can click on the annotation or on the link in the description box.
This time we'll continue looking at dimension reduction using a subset of the principal components with a minimal loss of information. We'll use NumXL and Excel to carry out our analysis of the output elements. This will help us comprehend PCA better and help us understand the elements of the process in the future tutorials.
As in our previous video we will use the socio-economic data provided in the 1976 book by Harry H Hartman. The five variables represent total population, median school years, total employment, miscellaneous professional services, and median house value. Each observation represents one of twelve census tracts in the Los Angeles standing Metropolitan Statistical Area.
Let's start! Select an empty cell in your worksheets where you wish the output to be generated then click on the PCA icon in the NumXL toolbar.
The PCA wizard pops up. Select the cells range for the five input variables, note that the input cells range includes the heading or the label cell which will be used in the output tables. The input variables are already grouped by columns so we don't need to change that. Let's leave the variable mask field blank for now, we will revisit this field in later tutorials. Also remember that by default the output cells range is set to the current selected cell in your worksheet. Let's look at the options tab.
Initially the tab is set to the following values, we'll want to change some of them.
The standardized input is checked. This option replaces the values of each variable with a standardized version, it overcomes the bias issue when the values of the input variables have different magnitude scales. Leave this option checked.
The principle component output is also checked. This option instructs the wizard to generate PCA related tables, we'll want this one unchecked.
The input variables is unchecked, we want to check this option. We want to set the number of PCs included to 3, this action can be done now or later in the output tables because our formulas are dynamic.
Also in the same area, check the values option so that the generate output tables include a fitted value for the input variables using a reduced set of components.
Lastly, notice that the significance level or alpha is set to 5%.
Now let's briefly check out the missing values tab before we click OK. Here you can select an approach to handle missing values in the dataset. By default any missing value found in any of the input variables would exclude the observation from the analysis. This treatment is still a good approach to what we're doing in this tutorial so let's leave it unchanged. You may now click OK.
Now let's check out our output tables. In the principal component analysis table we show the percentage of variance of each input variable accounted for using the first three factors. Unlike the cumulative proportion, this statistic is related to one input variable at a time. Using this table we can detect which input variables are poorly presented by our dimension reduction. In our case the median school years has the lowest value yet the final communality is still around 92%.
Now in a loadings table we outline the weights of principal component in each input variable.
To compute the values of an input variable using PC values we use the weights to linearly transform them back. For example, the population factor is expressed this way. Note that this table is basically the transpose table of the loadings table we saw in our first tutorial.
The rows are now columns, also note that by design the squared loadings for each PC add up to one and the different PCs are uncorrelated.
Finally, the values table shows the fitted value for each input variable, note that although the PCA uses the standardized version of the input variables it computes the fitted values in the original non standardized format.
Here are the plots for population and for medium school years for the original data and for the fitted one.
That is it for now, thank you for watching!