Tutorial de regresión de Stepwise - Parte 2

Mohamad

13 de julio de 2013 00:10

En este video, continuaremos discutiendo el análisis de regresión que empezamos en el video anterior. También presentaremos una regresión gradual que nos ayude a encontrar un grupo óptimo de variables explicativas para este modelo.

Para mayor información del documento tutorial actual y la hoja de cálculo visitenos en http://bitly.com/13hru9r

Guion de Video

Scene 1:

Hello and welcome to the second video of a regression analysis tutorial series. For the first video, click the annotation or the link in the description box.

In this tutorial we'll continue the analysis discussion we started earlier and use an advanced technique called stepwise regression to help us find an optimal set of explanatory variables for this model.

We'll use the same sample data we use in the first video, it has been gathered from 20 different salespeople. Weekly sales is the dependent variable and two explanatory variables are intelligence and extraversion. Like before take time to note how we organize our input data. Values for each variable are placed in separate columns and each observation or salesman is represented by a separate row.

Next we introduce the mask. The mask is a boolean array that chooses which variable is included or excluded from the analysis. At the top of the table let's insert the mask cells array each with value of 1. In this example we have 20 observations and two independent or explanatory variables. The response or dependent variable is the weekly sales.

Let's begin! Select an empty cell in your worksheet where the output is going to be generated and click on the regression icon on the NumXL tab.

Scene 2:

The regression wizard pops up. By default the output cells range is set to the current selected cell in your worksheet. Like we've said before, the selected cells range can include the column headings which will be used in the output tables and because explanatory variables are already grouped by columns we don't need to change anything.

Now select the cells range for the response dependent variable values or the weekly sales, then select the cells range for the explanatory or independent variables. For the mask variables select the cells of the boolean array at the top of the data table.

Once all of the data is selected the other tabs in the wizard become enabled. Let's check out the options tab.

Scene 3:

The tab is set to the default values you see here. Similar to the first tutorial we won't change anything. The regression intercept constant which like we've said in our first video indicates that the regression intercept will be estimated by the regression is left blank. If you want to set the regression to a fixed value enter it there.

The significance level is set to 5%. In the output section the most common regression analysis is selected. Lastly the auto modeling options should be left unchecked, we'll discuss it in a later video. Now let's see the missing values tab.

Scene 4:

Here you can select how you want to handle missing values in your data. By default any missing value found in X or in Y observations will be excluded from the analysis. This treatment is still a good approach for our purposes so let's leave it unchanged. Now click OK to generate the output tables.

Scene 5:

Let's examine the results. Aside from the mask variable setting everything else that we've done up to this point is the same as in our first tutorial.

Scene 6:

The mask setting determines which variable is included in the regression analysis so let's take another look at the coefficients table. First let's exclude the intelligence input variable from the analysis. This is done simply by flipping the mask value of the cell to zero.

Scene 7:

Then, if you have the calculation option set to manual you can force recalculation otherwise the spreadsheet recalculates automatically.

As we look at the output tables again we can see that there has been some changes. The r-squared dropped by six percent while the adjusted r-square had dropped by 1.5. The standard error increased by three dollars and AIC dropped by one. ANOVA table shows the regression is significant. Residual diagnosis checks out for all tests.

Lastly in the regression coefficients table the intercept and the coefficient of the extraversion variable are both statistically significant. You can see that despite having only one parameter the recalculated model explains the variation in the values of the response variable just as well as the one that had access to two variables.

Now let's plot the estimated values against the actual.

Scene 8:

The shaded area represents the 95% confidence interval for the estimates of the regression model. Let's recap! So far we've shown that dropping a variable from the analysis is as easy with NumXL as flipping a switch. This feature allows you to avoid cluttering your spreadsheets with tons of output tables.

But you might be wondering what if I have more than two explanatory variables then what's the optimal number to use? NumXL can help. The software supports stepwise regression function that helps you select the optimal set of variables.

Let's see how it works! First in the mask cells range turn the variables on or off that you wish the stepwise regression to consider. For this demonstration we'll will turn them all on. Now click on the regression icon in the NumXL tab.

Scene 9:

The regression wizard pops up. In the general tab select the input cells range and the mask cells range.

Scene 10:

Under the options tab check the stepwise regression box. Leave the other three different methods checked. Now click OK.

Scene 11:

The stepwise regression generates one additional table next to the coefficients. Let's take a closer look.

The table shows the results of a series of tests that determine whether to include or drop a certain variable from the regression model. Forward selection is a method that starts with considering one intercept and examines adding an additional variable. Backward elimination starts from the full model and considers dropping one repressor at a time. By directional elimination is a hybrid of the two methods. The table displays the mask for the optimal model found in each column, 1 stands for inclusion and 0 for exclusion. At the bottom we compare the regression statistics for each model. In this case the three models came back with the same set of variables.

So far we have created a regression model, examined its significance, verify that it satisfies underlying assumptions, and found the optimal subset of variables of the model.

Join us in our third video on regression where we'll discuss the model even further.

Scene 12:

That is it for now, thank you for watching!

Artículos relacionados

Comentarios