One of the techniques used by a Six Sigma Practitioner is the Multiple Regression Model.
This Model involves the analysis of two or more independent variables (Xs) for a Dependant Variable (Y).
Many Six Sigma Practitioners avoid using this model due to the complexiities involved in designing it.
The purpose of sharing this blog is to benefit those who know what multiple regression is and even for those who do not how it works.
If one wants to build a Multiple Regression model, follow the simple steps designed by me to reach an inference and make appropriate recommendations:
Step 1: Identification of Dependant (Y) and Independent Variables (X’s)
For example, if we wish to analyze the factors influencing overall customer loyalty, then our Y (Dependant Variable) will be Overall Loyalty and our X's (Independant variables influencing our Y) are as listed below
Y = Overall Loyalty
X1 = Customer Satisfaction
X2 = Value for Money
X3 = Product Range
X4 = Service Quality
X5 = Car Parking
X6 = Staff Friendliness
Step 2: Shortlisting important X’s which impact the Big ‘Y’ i.e – Overall Customer Loyalty
We will follow step wise regression in selecting the most important variables for our analysis. One can use Minitab to do this analysis.
Step 3: We check for Pearson’s Correlation coefficient to see high correlations in any of the X's.
For eg: If we see there is high correlation between Service Quality and Customer Satisfaction variables then we know that Multicollinearity exists in our model. Multicollinearity can have significant impact on the quality and stability of the fitted regression model.
A common approach to multicollinearity problem is to omit explanatory variables. For example if x1 and x2 are highly correlated (say correlation is greater than 0.9), then the simplest approach would be to use only one of them, since one variable conveys essentially all the information in the other variable.
Step 4: Determining Standard Error Of Estimate (S)
The Standard Error of estimate is the measure of dispersion around the multiple regression plane. In multiple regression the estimation becomes more accurate as the degree of dispersion around the regression gets smaller. This ‘S’ value indicates the extent of error in our estimation of the value of dependant variable (Y). If the multiple regression equation fits the entire data perfectly, then our prediction of Y is most accurate and there is no error.
Step 5: Determining Coefficient of Determination (R2)
In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. Adjusted R2 is a modification of R2 that adjusts for the number of explanatory terms in a model. Unlike R2, the adjusted R2 increases only if the new term improves the model more than would be expected by chance. The adjusted R2 can be negative, and will always be less than or equal to R2. Adjusted R2 does not have the same interpretation as R2. As such, care must be taken in interpreting and reporting this statistic. Adjusted R2 is particularly useful in the Feature selection stage of model building.
Adjusted R2 is not always better than R2: adjusted R2 will be more useful only if the R2 is calculated based on a sample, not the entire population. For example, if our unit of analysis is a state, and we have data for all counties, then adjusted R2 will not yield any more useful information than R2. The use of an adjusted R2 is an attempt to take account of the phenomenon of statistical shrinkage
Step 6: Determining Coefficient of Correlation(R)
R = Square root of R2
An R = 1 would mean perfect correlation and R = 0 means no correlation at all. The value of R between 0 and 1 measures the degree of correlation.
The coefficient of determination is a much more precise measure of the strength of the relationship between the two variables and is subject to more precise interpretation because it can be presented as a proportion or as a percentage. It can be defined as the proportion of variation in Y that is explained by X.
Step 7: Test of significance for Regression Model
We will now test if the given regression model is significant for our analysis.
For the same, we conduct a Hypothesis test called the ‘F’ Test.
F- ratio is given by Explained variance divided by unexplained variance.
Our Null and Alternative Hypothesis are as stated below:
Ho: Null = The regression is not significant
Ha: Alternative = The regression is significant
If the calculated value > critical value at 0.05 level of significance, we cannot accept Null Hypothesis and vice versa equation results in acceptance of Null Hypothesis.
How to include a variable which cannot be quantified
Multiple regression allows the use of technique called Dummy variables or Binary Variables. With these in place, we can even include qualitative factors like gender. Binary variables are coded 0, 1
For example:. For example a variable x1 (Gender) is coded, male = 0 female = 1. Then in the regression equation:
Y = β0 + β1x1 + β2x2 when x1 = 1 the value of Y indicates what is obtained for female gender; and when x1 = 0 the value of Y indicates what is obtained for males.
If we have a nominal variable with more than two categories we have to create a number of new dummy (also called indicator) binary variables.
Dummy Variables also help us in increasing the accuracy of our estimating equations.
I'd be more than happy to help anyone build a multiple regression model for their processes.
Please feel free to write to me at tinaarora12@rediffmail.com
Until then,
Chao