Multiple Regression as One of the Most Widely Used Tools in Term Paper Example | Topics and Well Written Essays

?Multiple Regression 0 General Purpose Multiple regression is simply an extension of linear regression, a statistical process which seeks to find the linear relationship between an independent variable and a dependent variable. In the case of multiple regression, the main purpose is to find the linear relationship between the dependent variable and a number of independent variables (Yan & Su, 2009). Multiple regression is one of the most widely used tools in statistical analysis because it is a very good reflection of real world situations. That is, a certain outcome is almost always affected by a number of predictors rather than just a single one (Dekking, 2005). Oftentimes, when a model is too simple that it only contains one independent variable, such a model is of limited value because the predictions made from such models are too inaccurate to be useful in a real-world setting. Thus, when one wants to be able to predict an outcome at a more precise level, it is more advantageous to use the information that may be provided by two or more variables in an explanatory framework (Burt, Barber, & Rigby, 2009). Thus, multiple regression analysis should allow an analyst to arrive at better predictions. For example, a student may want to find out the perfect model to getting high grades in school. Using the results of his individual exams as the dependent variable, he may hypothesize that amount of time spent studying, amount of sleep taken the night before the exam, amount of beer drank the night before the exam, caloric intake (or a fancy phrase for how heavy his meal was) prior to taking the exam, and even the presence of his lucky rabbit’s foot are possible factors for scoring well in the exam. Using multiple regression analysis, the student may find out that amount of time spent studying, amount of sleep taken the night before the exam, and amount of beer drank the night before the exam are significant predictors of his exam scores. He may even be able to come up with a regression model that will then allow him to forecast his next exam score, given the amount of time spent studying, amount of sleep taken the night before the exam, and amount of beer drank the night before the exam. On a more professional level, multiple regression is used researches in the social and natural sciences to attempt to find the best predictor of different outcomes (Yan & Su, 2009). For example, oncologists may be interested in the best predictors of lung cancer, educators may want to know what are the best predictors of SAT scores, and psychologists would want to find out which factors best predict depression among a particular age group. These questions may all be answered with the help of multiple regression. 2.0 Computational approach The main goal of linear regression, in this case, multiple regression, is to be able to fit a regression line through a number of given points (Wang & Jain, 2003). This regression line is sometimes called the line of best fit and this is the line that represents the regression model of a given problem. These points are usually best represented graphically in a scatter plot. While it is quite easy to produce a scatter plot when there is only one independent and one dependent variable, multiple regression presents the challenge of having more than one independent variable thus making the practice of making a scatterplot impractical (Dekking, 2005). 2.1 Least Squares In regression modeling, the basic estimation procedure used is the least squares method (Black, 2010). Since the main goal of linear regression is to fit a line through the points, least squares estimation is used to compute this line in such a way that the squared deviations of the observed points from this line are minimized (Wang & Jain, 2003). 2.2 The Regression Equation The bivariate form of simple linear regression produces a two-dimensional line in a two-dimensional space. This equation is defined by: Y = a + bX, where Y is the dependent variable being forecasted by the regression model, X is the independent variable that serves as the predictor of the model, a is the intercept of the regression line, and b is the slope of the regression line, also known as the regression coefficient (Yan & Su, 2009). Since multiple regression is correspondingly multivariate in form, the regression line with n independent variables is estimated by: Y = a + b1X1 + b2X2 + … + bnXn, where Y is the dependent variable being forecasted by the regression model, a is the intercept of the regression line, while the bi’s are the regression coefficients and Xi’s are the predictors of the model (Black, 2010). 3.0 Assumptions, Limitations and Practical Considerations Before applying multiple regression analysis on a data set, certain assumptions must be met. These assumptions include linearity and normality. 3.1 Linearity assumption Evidently, multiple linear regression must assume that the relationship between the dependent and independent variables is linear. In estimating the linearity relationship, the most convenient method would be to take a look at the bivariate scatterplots of each of the predictor with the dependent variable (Bender, 2000). If the scatterplot follows more of a curved pattern, one may consider either transforming the variables or just mention that nonlinear components are used. The problem of nonlinearity is not really a major one because multiple linear regression procedures are not greatly affected by minor deviations from this assumption (Black, 2010). 3.2 Normality Assumption Multiple regression procedures require that the residuals, or the difference between the predicted and observed values, are distributed normally (Cohen, 1988). This may be verified by producing histograms and normal probability plots for the residuals. 3.3 Limitations As with bivariate linear regression, the main limitation of multiple linear regression is that while one can establish relationships between the dependent and independent variable, the underlying causal mechanism between these variables may not actually be ascertained (Wang & Jain, 2003). Multicollinearity is also a problem with multiple regression techniques. Multicollinearity is the problem that surfaces in multiple linear regression when two or more independent variables are highly correlated (Yan & Su, 2009). For example, in predicting a student’s SAT score, with raw score and percentage score as predictors would be redundant because one is merely a different representation of the other. Thus, it is usually prudent to check for collinearity among the independent variables and appropriately decide which predictor to remove in solving for the regression model (Burt, Barber, & Rigby, 2009). 3.4 Using technology Software packages such as the Data Analysis add-in of Microsoft Excel and the Statistical Package for the Social Sciences (SPSS) are very helpful in multiple regression analysis because they allow the convenience of providing computations in a very small amount of time even with a large set of data points or a large number of independent variables (Cohen, 1988). 4.0 Steps in Performing Multiple Linear Regression While the steps outlined below may not necessarily apply to every multiple regression problem, they are the basic steps to follow in performing the analysis. 4.1 State the research hypothesis This is the actual hypothesis that the research study wants to investigate. 4.2 State the null hypothesis The null hypothesis simply assumes the absence of relationship between the dependent and independent variables in the study. 4.3 Gather the data 4.4 Assess each dependent and independent variable separately. This is where the descriptive statistics fall, with measures of central tendency and measures of variation being determined. In addition, it would also be best to check for normality of the variables. 4.5 Assess the relationship of each independent variable, one at a time, with the dependent variable. This is where the correlation coefficient is calculated. Scatterplot may also be obtained in this part in order to test for linearity. 4.6 Assess the relationships between all of the independent variables with each other It is best to obtain a correlation coefficient matrix for the independent variables in order to address possible issues of multicollinearity. 4.7 Calculate the regression equation from the data This part is usually taken care of by a statistical software like the Data Analysis Add-In of Microsoft Excel or by SPSS. In addition, measures of association and tests of statistical significance for each coefficient and for the equation as a whole are already generated in the computer output. In interpreting the summary output, Multiple R in the Regression Statistics table represents the extent of the relationship between the dependent and independent variables. Under the ANOVA table, once should look for the F and the df, which stands for degrees of freedom. Significance F would indicate whether or not the model is statistically significant at the alpha level (should be less than or equal to alpha to be considered significant). Finally, the coefficients will serve as the constant coefficient, a (Intercept) and the regression coefficients bi (other variables). The corresponding p-value will determine if these coefficients are significant (should be less than or equal to alpha to be considered significant). 4.8 Accept or reject the null hypothesis A statistically significant result would lead to the rejection of the null hypothesis and the acceptance of the research hypothesis. Correspondingly, if the result is not statistically significant, then the null hypothesis is accepted and the research hypothesis is rejected. 4.9 Explain the practical implications of the findings This is where the corresponding analysis of results, conclusion, and recommendations are inputed. 5.0 Examples with Solutions 5.1 Factors affecting property value Background: In acquiring property, homeowners are especially concerned about the perceived social conditions of a particular location. Safety and an affluent company are considered by many as major considerations in deciding on how much to invest in a particular property. The Problem: Using the available data in Table 1 on House Price, Crime Index, and Mean gross pay for twelve Scottish Local Authority areas, determine if Crime Index and Mean Gross pay have a combined effect on the House Price in these areas. Table 1. Social, economic and housing market data for Scottish Local Authority area. Local Authorities House Price Crime Index Mean gross pay (?'s) (? per week) Aberdeen City 177,003.00 161.00 495.10 Aberdeenshire 198,837.00 54.00 509.50 Angus 148,491.00 68.00 422.40 Argyll & Bute 146,402.00 58.00 409.70 Clackmannanshire 124,495.00 78.00 409.20 Dumfries & Galloway 136,304.00 64.00 401.40 Dundee City 124,660.00 137.00 418.10 East Ayrshire 113,139.00 95.00 419.20 East Dunbatonshire 184,210.00 56.00 551.20 East Lothian 201,359.00 53.00 473.10 East Renfrewshire 199,602.00 53.00 540.60 Edinburgh City 213,915.00 139.00 519.90 Solution: The data in Table 1 were entered in Microsoft Excel and using the Data Analysis Add-In, regression analysis was performed with House Price as the dependent variable and Crime Index and Mean Gross Pay as independent variables. Simple linear regression reveals that when taken as individual independent variables, Crime Index is not a significant predictor, F(1,10) = 0.07, n.s., while Mean Gross Pay is a significant predictor of House Price, F(1,10) = 29.19, p < .01. However, when multiple regression is used to test the combined effect of Crime Index and Mean Gross Pay on House Price, analysis reveals that the two variables have a significant combined effect on House price, F(2,9) = 13.52, p < .01. Conclusion: When both Crime Index and Mean Gross pay are considered as predictors of House Price, the result is significant and House Price may be forecasted using the model: Y = 541.2X1 – 67.2X2 – 81454.2 where, Y = House Price, X1 = Mean Gross Pay, and X2 = Crime Index. Summary Output for Regression of House Price on Mean Gross Pay Regression Statistics Multiple R 0.863038 R Square 0.744834 Adjusted R Square 0.719318 Standard Error 18872.37 Observations 12 ANOVA df SS MS F Significance F Regression 1 10396573011.08 10396573011.08 29.19 0.00 Residual 10 3561664733.17 356166473.32 Total 11 13958237744.25 Coefficients Standard Error t Stat P-value Intercept -87374.04 46850.92 -1.86 0.09 Mean Gross Pay 541.69 100.26 5.40 0.00 Summary Output for Regression of House Price on Crime Index Using Excel Regression Statistics Multiple R 0.083088 R Square 0.006904 Adjusted R Square -0.09241 Standard Error 37231.54 Observations 12 ANOVA df SS MS F Significance F Regression 1 96362970.44 96362970.44 0.07 0.80 Residual 10 13861874773.81 1386187477.38 Total 11 13958237744.25 Coefficients Standard Error t Stat P-value Intercept 170437 26554.54 6.42 0.00 Crime Index -75.6172 286.80 -0.26 0.80 Summary Output for Regression on Mean Gross Pay and Crime Index Regression Statistics Multiple R 0.86619 R Square 0.750285 Adjusted R Square 0.694792 Standard Error 19679.62 Observations 12 ANOVA df SS MS F Significance F Regression 2 10472651947.43 5236325973.71 13.52 0.00 Residual 9 3485585796.82 387287310.76 Total 11 13958237744.25 Coefficients Standard Error t Stat P-value Intercept -81454.15 50647.83 -1.61 0.14 Mean Gross Pay -67.19 151.60 -0.44 0.67 Crime Index 541.20 104.56 5.18 0.00 5.2 Factors affecting Greenhouse Gas Emission Background: During the past decade, countries all over the world have experienced torrential rains during the dry season and prolonged droughts during the wet season. Summers are becoming hotter than usual while winters are becoming more and more intolerable. Sea levels are rising as freak hurricanes that last for an unusually long time have brought tremendous damage to many parts of the world. The world is, undoubtedly, at the mercy of climate change. The Problem: Using the available data in Table 2 on the Gross Domestic Product (GDP), Population Density (PD), and Greenhouse Gas Emissions (GHG) of a sample of twelve countries, determine if GDP and PD have a significant combined effect on the GHG levels in these countries. Table 2. Economic and environmental performance indicators of sampled countries. Country GHG (in tons) GDP (per capita in USD) PD (per sq km) Australia 26 48,707 3 Canada 22 45,051 3 China 4 3,404 12 Germany 12 44,525 231 India 2 1,066 126 Italy 9 38,887 190 Japan 11 38,271 337 Mexico 5 10,216 50 Poland 10 13,887 124 South Africa 10 5,685 63 UK 11 43,652 244 USA 24 47,155 30 Solution: The data in Table 2 were entered in Microsoft Excel and using the Data Analysis Add-In, regression analysis was performed with GHG as the dependent variable and GDP and PD as independent variables. Simple linear regression reveals that when taken as individual independent variables, GDP is a significant predictor, F(1,10) = 14.88, p < .01., while PD is not a significant predictor of GHG, F(1,10) = 1.13, n.s. However, when multiple regression is used to test the combined effect GDP and PD on GHG, analysis reveals that the two variables have a significant combined effect on GHG, F(2,9) = 28.53, p < .01. Conclusion: When both GDP and PD are considered as predictors of GHG, the result is significant and GHG level may be forecasted using the model: Y = .00036X1 – 0.03741X2 + 6.3427 where, Y = GHG, X1 = GDP, and X2 = PD. Summary Output for Regression of GHG on GDP Regression Statistics Multiple R 0.77 R Square 0.60 Adjusted R Square 0.56 Standard Error 5.20 Observations 12 ANOVA df SS MS F Significance F Regression 1 401.72 401.72 14.88 0.00 Residual 10 269.94 26.99 Total 11 671.67 Coefficients Standard Error t Stat P-value Intercept 3.36 2.73 1.23 0.25 GDP 0.00 0.00 3.86 0.00 Summary Output for Regression of GHG on PD Regression Statistics Multiple R 0.32 R Square 0.10 Adjusted R Square 0.01 Standard Error 7.77 Observations 12 ANOVA df SS MS F Significance F Regression 1 68.22 68.22 1.13 0.31 Residual 10 603.44 60.34 Total 11 671.67 Coefficients Standard Error t Stat P-value Intercept 14.81 3.35 4.42 0.00 PD -0.02 0.02 -1.06 0.31 Summary Output of Regression of GHG on GDP and PD Regression Statistics Multiple R 0.93 R Square 0.86 Adjusted R Square 0.83 Standard Error 3.19 Observations 12 ANOVA df SS MS F Significance F Regression 2 580.16 290.08 28.53 0.00 Residual 9 91.51 10.17 Total 11 671.67 Coefficients Standard Error t Stat P-value Intercept 6.342689 1.82 3.48 0.01 GDP 0.00036 0.00 7.10 0.00 PD -0.03741 0.01 -4.19 0.00 6.0 References Bender, E. (2000). An introduction to mathematical modelling. Mineola, NY: Dover Publications, Inc. Black, K. (2010). Business Statistics: Contemporary Decision Making. Hoboken, NJ: John Wiley & Sons. Burt, J., Barber, G., & Rigby, D. (2009). Elementary statistics for geographers. Guilford Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Routledge. Dekking, M. (2005). A modern introduction to probability and statistics: Understanding why and how. Springer. Wang, G., & Jain, C. (2003). Regression analysis: Modeling and forecasting. Institute of Business Forec. Yan, X., & Su, S. (2009). Linear regression analysis: Theory and computing. World Scientific. For Problem 1: Birch, E., & Wichter, S. (. (2008). Growing greener cities: Urban sustainability in the 21st century. Philadelphia: University of Pennsylvania Press. Chorafas, D. (2007). Risk Accounting and Risk Management for Accountants. MA: CIMA Publishing. Chu, W. (2009, March 2). Home Prices and Median Household Income. Retrieved December 13, 2010, from The Online Community for Bond Market Investors and Professionals: http://www.fixedincomecolor.com Kazmier, L. (2004). Schaum's Outlines: Theory and Problems on Business Statistics (4th ed.). USA: McGraw-Hill. Woolridge, J. (2009). Introductory econometrics. Mason, OH: South Western Cengage Learning. For Problem 2: Claussen, E. (. (2001). Climate change: Science, strategies and solutions. Arlington, VA: Pew Center on Global Climate Change. Independent Statistics and Analysis. (2008). Retrieved December 11, 2010, from US Energy Information Administration: http://www.eia.doc.gov Marber, P. (2009). Seeing the elephant: Understanding globalization from trunk to tail. USA: John Wiley and Sons. Sawin, J. L. (2004). Mainstreaming Renewable Energy in the 21st Century. Danvers, MA: Worldwatch Institute. Victor, D. (. (2004). Climate change: debating America's policy options. USA: Council on Foreign Relations, Inc. Read More

Multiple Regression as One of the Most Widely Used Tools in Statistical Analysis - Term Paper Example

Extract of sample "Multiple Regression as One of the Most Widely Used Tools in Statistical Analysis"

CHECK THESE SAMPLES OF Multiple Regression as One of the Most Widely Used Tools in Statistical Analysis

Westar beverage

Week 4

Measures and Scale

GENSTAT Analysis of Variance of Two Seed

Simple leaner regression

The prevention of ID Theft against women while shopping

Determinants of Plasma Retinol and Beta-Carotene Levels

Determinants of plasma retinol and beta-carotene levels