Influential Data Regression Analysis - Part 3

Mohamad

July 13, 2013 00:12

In this video, we will continue the discussion of regression analysis we started in our previous video. We'll also introduce influential data analysis using cook's distance and leverage factor statistics to help us improve the model and the forecast's reliability.

For more information and/or the spreadsheet example file, visit us at http://bitly.com/12C21jr

Video script

Scene 1:

Hello and welcome to the third video of our regression analysis tutorials series. To watch the previous video click the annotation or the link in the description box.

In this tutorial we'll continue the analysis discussion we started earlier and use an advanced technique called influential data analysis to help us improve the model and the forecast reliability.

We'll use the same sample data as in our previous videos. It has been gathered from 20 different salespeople. Weekly sales is a dependent variable and the two explanatory variables are intelligence and extraversion. Like before, take time to note how we organize our input data. Values for each variable are placed in separate columns and each observation or salesman is represented by a separate row.

Next, we again introduce the mask. The mask is a boolean array that chooses which variable is included or excluded from the analysis. At the top of the table let's insert the mask cells array, each with a value of 1.

Now we're ready to begin! Select an empty cell in your worksheet where you wish for the output to be generated, then click on the regression icon on the NumXL tab.

Scene 2:

The regression wizard pops up. Remember that by default the output cells range is set to the current selected cell in your worksheet.

Like we've said before you can include the column headings when selecting the cells ranges. The labels will be used in the output tables. Because explanatory variables are already grouped by columns, we don't need to change the checked option. Now select the cells range for the response dependent variable values or the weekly sales, then select the cells reach for the explanatory or independent variables. For the variable mask option select the cells of the boolean array at the top of the data table.

Once all of the data is selected notice that the other tabs on table become enabled. Let's take a look at the options tab.

Scene 3:

At first the tab is set to the default values you see here.

The regression intercept constant which like we've said in our previous videos indicates that the regression intercept will be estimated by the regression is left blank. If you want to set the regression to a fixed value enter it there.

The significance level is set to 5%. In the output section the most common regression analysis is selected. Lastly, unlike in previous tutorials we'll want to check the auto modeling option.

Now let's look at the missing values.

Scene 4:

Like we've discussed before, here you can select how you want to handle missing values in your data. By default any missing value found in X or in Y observations will be excluded from the analysis. This treatment is still a good approach for our purposes so let's leave it unchanged. Click OK to generate the output tables.

Scene 5:

To assess the influence that each observation exerts on our model, we need to calculate a couple of statistical measures, leverage or H and cooks distance or D. In order to do that select the cell next to the response variable and in the formula bar type MLR_FITTED. Then click the FX button.

Scene 6:

Once the function wizard pops up select the input cells range, the mask, and a return type of four for the leverage statistics, then click OK.

Scene 7:

The function returns an array of values but you will initially only see the first value. To display the full array select all the cells below to the end of the sample. Press F2 together with Ctrl + Shift + Enter to copy the formula down. Now in order to calculate the Cook's distance let's repeat what we've just done to get the numbers for leverage, but with the return type of five instead of four.

Scene 8:

Now that we have the leverage and cooks distance statistics let's interpret their findings. Let's first check out the leverage table. Leverage statistics measure the distance of an observation from the center of the data. In our example the intelligence and extraversion values for salesman number 11 are farthest away from the average.

Does this mean that salesman number 11 is an outlier? Does it mean that he exerts the most influence on the calculation of the regression coefficient? To answer these questions let's remove salesman number 11 from our input data and examine the resulting regression.

Scene 9:

To do so just insert an N/A value in any input variable associated with salesman number 11.

Scene 10:

We observed that dropping the observation from our analysis left things the same as before, so we chose to insert salesman number 11 back into our sample. As you can see, leverage statistics do not necessarily point out an outlier, but merely show a distant observation with few neighbors.

Scene 11:

Now let's look at the Cook's distance table. This statistical measure corrects for the weakness in the leverage statistics and is thus more indicative of influential data. It uses a few methods as threshold values in order to detect an influential data point. For our analysis we often use 4 over N as a threshold. By examining the plots we can see that salesman number 16 exerts the highest influence on our regression, so let's avoid this observation like we did before by setting N/A in one of the input variables.

Scene 12:

Let's now examine the regression statistics before and after we dropped the sixteenth observation. As you may have already noticed the regression improved significantly on every dimension. Salesman number 16 seems to be the influential outlier, so we'll drop it.

Scene 13:

Now let's look at how Cook's distance table looks after salesman number 16 is out of the picture. We can see that the different plots are distributed somewhat uniformly, but bear in mind that our threshold rule of four over N is merely a guideline and should not be taken rigidly.

Scene 14:

To help explain what makes an observation influential let's look at the extraversion versus weekly sales graph. We draw the linear trend as a proxy for our regression model. The black circle represents salesman number 16. This data points location is pulling the regression or the dashed line toward it, affecting the value of the regression slope and intercept. Dropping this observation releases the regression line, adjusting it to better fit the remaining points.

Scene 15:

That is it for now, thank you for watching!

Related articles

Comments