Data Science me!

Analysis of Variance (ANOVA)

R Squared for the model below shows a value of .974616686 and an R Squared value close to 1 indicates a model that is highly accurate in predicting the target value. The formula for Adjusted R squared is also shown. Adjusted R Squares is a way of penalizing a model for having more variables. If another model were produced with a similar R Squared value, but it only used 3 predictor variables, it's Adjusted R Squared value would be higher and it would be preferable as equally accurate but less complex.  

Evaluating a Linear Model

For the time being forget about how Standard Error is calculated, I'll get to that below. The t Stat value is the Coefficient Value divided by the Standard Error. I'll still looking for a good formula to calculate the P-value, however the formula has been elusive. However, it can be easily calculated in Excel with the following. [= TDIST(ABS(t-stat),5,2)] Again, this merely converts the t Stat into a probability for easier interpretation. 

Calculating the Lower and Upper ranges is simple with large samples. You may recall that in a standard distribution, 1 standard deviation will represent 68% of the values, 2 standard deviations will represent 95% (the value is actually 1.96 standard deviations for 95% but is generally rounded to 2) and 3 standard deviations will represent 99.7% of the values. With a large sample to calculate the upper and lower bound at 95%, you would multiple the Standard error times 2 standard deviations and add and subtract that value from the coefficient. 

With only 10 observations in this sample an adjustment needs to made to account for a small sample.  Excel also has a formula for this, in this case, .05 represents 95% confidence and 5 is the degrees of freedom as noted earlier. =TINV(0.05,5)

The first set of values to be calculated are the Sum of Squared Error (the total of the Squared Error column). This is a value representing the total difference between the predicted values and the actual values. 

The Variance of the mean column is the difference between the actual value and the mean of all of the actual values. In effect we are going to compare the model performance against just using the average of the values we are trying to predict. Taking the first observation of 40 for the amount minus 72.5, the average of the amount column gives a value of -32.5. This really shows that if our model needs to be better than if we were to just use the average as the prediction. This should be the maximum amount of error in the model. The SSR or Sum or Squares from Regression is the Total Sum of Squares minus the Sum of Square Error. SSR = SST - SSE or (4062.50 - 3959.5 = 103.1).  Note, there's some rounding in the values.

The final step in this post is to look at the individual coefficients to determine if they are significant. The table to the right shows the Coefficients from the model, a Standard Error which is calculated below. The t Stat is a calculation of how large the error is relative to the Coefficient value. If the Standard Error is high it indicates that the variable doesn't add a lot to the final prediction. As that value is hard to read directly, it's converted into a P-value (probability value) with lower numbers indicating that there's a low probability that the variables significance is random. In other words, a low P-value indicates a high likelihood that the variable is significant.

In a future blog I'll discuss variable selection, but for this post, look at the Age-Years variable. It's Standard Error is 3 times as large as the Coefficient valuable leading to very large Lower and Upper bounds. this is probably not a good variable for this model. However as I noted earlier, this blog isn't about judging variables, just understanding where the numbers come from.   

Sum of Squared Values

At some point, you've probably used the following output to judge the quality of a linear model, but, I bet you've never gone to the trouble to figure out how it's actually calculated and there are probably two reasons for that: 1-thankfully technology does it for us and 2-it's actually pretty complicated. This installment will go through the math behind all of these numbers.

This output is from Excel but other applications produce similar output.

R Squared and Adjusted R Squared

Standard Error of the Coefficient

Regression Applied

In a prior post, a regression model was developed on the standard data set of 10 observations and 4 variables. This post will make more sense if you review that post first and it can be found here:  

Once a regression model has been developed, the first step is to apply the results against the data used to create it to determine how well it predicts the expected values. Below are the coefficients that were calculated in our regression model and they've been applied against the variables to calculate the "prediction". The column in red below, is the amount we were trying to predict. As you can see to the right, the values in the prediction column are fairly close. As in the prior models on this site, see the bottom for the Excel file for all of the calculations again, using simple math where possible. 

The variance column is calculated by subtracting the Prediction Value from the original amount. The variance is then squared to get the squared error. If you've forgotten why the variance is squared, try summing the values in the variance column. You'll find they will sum to zero. Squaring the column is a way of converting the negative values to positive values so the total will not sum to zero.

The final item to calculate is the Standard Error of the Coefficient. For this, a matrix is needed from the original regression file. This table originally showed up as Step 5 and was multiplied by the X'Y matrix in order to calculate the Coefficients. The only values that will be needed now though, are the diagonal cells highlighted in yellow. Notice that these are where each Coefficient in the table intersects with itself.

The remaining number needed is the 20.623942 which is the Mean Sum of Squares for the Residual. See the ANOVA table above. Each value from the yellow highlighted diagonal is multiplied by the Mean Sum of Squares to calculate the Standard Error that is related to each variable including the intercept. 

Evaluation of Coefficients

An ANOVA table is also common as output for evaluating models. R Squared, and Adjusted R Squared have already been calculated. "Multiple R" is the square root of R Squared (or R not Squared).

Note the Sum of Square values calculated above also appear and will be used again. df indicates Degrees of Freedom, something that will be discussed in a future post but, these values are pretty simple to calculate. df for Regression is the number of variables -1 and the df for the Residual is the number of observations - the number of variables (10 observations - 5 variables = 5).

The MS column is the Mean Sum of Squares for each and it's calculated by dividing the value in the SS column by the value in the df (degrees of freedom column). The F value is then the MS values divided, 989.8450721/20.6239423 or 47.99494964. That F value would then be used against a significance table in order to determine if the model was accurate enough for its output to be significant. The easiest way to think about this is that MS value for the Regression, i.e. 989.845... is much larger than 20.623... so must of the variance in the observations is being captured by the regression. This is also why the R Squared value is so large.