Chi-Square Test
Prof Alan Fahey and Prof Trudee Fair
What is the Chi-Square Test?
The chi-square test is a statistical test to determine if the observed values in your dataset are different from the values that you would expected by non-random chance. Non-random chance is determined by the P-value generated. If the p-value is less than 0.05, then we can say that there is only a 5% probability that the observed values are different to the expected values by random chance, therefore we deem the difference to be significantly different.
What is the Chi-Square Test formula?
where λ2 = chi-square, O = observed and E = expected.
When do you use a Chi-Square Test?
A chi-square is used when the data you collect is categorical (qualitative), for example male or female, three bouts of aggression etc.
How to calculate the observed value?
The observed value is the data that you collect as part of your project. You might want to consider how to represent this data, and this should be consistent with you research objective. For example if you are collecting focal behaviour data every 5 minutes, it may be more useful to tabulate hourly behaviours and look at changes on a hour by hour basis.
How to calculate the expected value?
Example
A survey is conducted of 175 young adults whose parents are classified either as wealthy, middle class or poor to determine their highest level of schooling (graduated from university, graduated from high school or neither). The results are summarized on the left side of Figure 1 Based on the data collected is the person’s level of schooling independent of their parents’ wealth?
Hypothesis
Ho: Highest level of schooling attained is independent of parents’ wealth
We now show how to construct the table of expected values. We know that 45 of the 175 people in the sample are from wealthy families, and so the probability that someone in the sample is from a wealthy family is 45/175 = 25.7%. Similarly the probability that someone in the sample graduated from university is 68/175 = 38.9%. But based on the null hypothesis, the event of being from a wealthy family is independent of graduating from university, and so the expected probability of both events is simply the product of the two events, or 25.7% ∙ 38.9% = 10.0%. Thus, based on the null hypothesis, we expect that 10.0% of 175 = 17.5 people are from a wealthy family and have graduated from university.
Conduct the Chi Square Test with the following code:
CHITEST(B6:D8,H6:J8) = 0.003273 < .05 = α
This result implies we reject the null hypothesis.
QMB 3200
Homework #9
Instructions:
1. Solve all the problems. All the problems carry 20 points each. Maximum score for this Homework is 120 points.
1. Presenting only the final answer is not sufficient to get complete credit. Show the steps in solution approach. That way partial credit can be earned to various steps in final solution. It is your responsibility to demonstrate mastery of the subject matter through your answers.
1. Use EXCEL. Doing so will save you plenty of time. Submit your report as a single Excel file in . Solve each problem on a separate tab (worksheet). No Exceptions. DO NOT, I repeat, DO NOT try to solve the problems using calculator. Organize your solutions on the Excel worksheet properly. Show where your answers are for each problem and the sections of the problem. Use proper formatting. Name Your File to show your Full Name and the HW Number.
1. Upload your report file on Canvas and verify if everything is fine by opening up the uploaded file. It is your responsibility to ensure your report is uploaded properly.
1. Do not wait until the last minute. The deadline is strictly enforced by Canvas. No hardcopy submissions are accepted. No e-mail submissions are accepted. If your file does not appear on Canvas by the deadline, zero points will be recorded for you for that HW. No exceptions are entertained for any reason under any circumstance in this regard.
1. PC World rated four component characteristics for 10 ultraportable laptop computers: features; performance; design; and price. Each characteristic was rated using a 0–100 point scale. An overall rating, referred to as the PCW World Rating, was then developed for each laptop. The following table shows the performance rating, features rating, and the PCW World Rating for the 10 laptop computers.
Model |
Performance |
Features |
PCW Rating |
Thinkpad X200 |
77 |
87 |
83 |
VGN-Z598U |
97 |
85 |
82 |
U6V |
83 |
80 |
81 |
Elitebook 2530P |
77 |
75 |
78 |
X360 |
64 |
80 |
78 |
Thinkpad X300 |
56 |
76 |
78 |
Ideapad U110 |
55 |
81 |
77 |
Micro Express JFT2500 |
76 |
73 |
75 |
Toughbook W7 |
46 |
79 |
73 |
HP Voodoo Envy133 |
54 |
68 |
72 |
a) Perform Multiple Regression Analysis treating PCW World Rating as the dependent variable.
b) Write down the estimated regression equation.
c) Interpret the slope coefficients for each of the independent variables.
d) Conduct Hypothesis tests on Regression and Individual coefficients at 0.05 level of significance.
e) What are the values of Coefficient of Multiple Determination and Adjusted Coefficient of Multiple Determination?
f) Comment on Goodness of Fit between the dependent variable and the two independent variables.
g) What is the expected PCW Rating when Performance Rating is 65 and Features Rating is 82?
2. The owner of Showtime Movie Theaters, Inc., would like to estimate weekly gross revenue as a function of advertising expenditures. Historical data for a sample of eight weeks follow.
Television Advertising ($1000s) |
Newspaper Advertising ($1000s) |
Weekly Revenue ($1000s) |
3 |
3.3 |
98 |
3.5 |
2.3 |
97 |
2.5 |
4.2 |
97 |
5 |
1.5 |
99 |
2 |
2 |
93 |
4 |
1.5 |
98 |
2.5 |
2.5 |
95 |
3 |
2.5 |
97 |
Predictor CoeffSE Coef T P
Constant 86.230 1.574 54.79 0.000
Television Advertising ($1000s) 2.2902 0.3041 7.53 0.001
Newspaper Advertising ($1000s) 1.3010 0.3207 4.06 0.010
S = 0.642587 R-Sq = 91.9% R-Sq(adj) = 88.7%
Analysis of Variance
Source DF SSMS F P
Regression ?23.435? ? 0.002
Residual Error ???
Total ?25.500
Source DF Seq SS
Television Advertising ($1000s) 1 16.640
Newspaper Advertising ($1000s) 1 6.795
a) Write down what the estimated regression equation is that relates weekly revenue equation with both television advertising and newspaper advertising as the independent variables.
b) Interpret the slope coefficients for each of the independent variables.
c) Complete the ANOVA Table
d) Conduct Hypothesis tests on Regression and Individual coefficients at 0.05 level of significance.
e) What are the values of Coefficient of Multiple Determination and Adjusted Coefficient of Multiple Determination?
f) Comment on Goodness of Fit between the dependent variable and the two independent variables.
g) How are R-Sq and R-Sq (adj) calculated?
h) What is the gross revenue expected for a week when $3500 is spent on television advertising and $1800 is spent on newspaper advertising?
3. Refer to the Johnson Filtration problem introduced in this section. Suppose that in addition to information on the number of months since the machine was serviced and whether a mechanical or an electrical repair was necessary, the managers obtained a list showing which repairperson performed the service. The revised data follow.
Repair Time in Hours |
Months Since Last Service |
Type of Repair |
Repairperson |
2.9 |
2 |
Electrical |
Dave Newton |
3 |
6 |
Mechanical |
Dave Newton |
4.8 |
8 |
Electrical |
Bob Jones |
1.8 |
3 |
Mechanical |
Dave Newton |
2.9 |
2 |
Electrical |
Dave Newton |
4.9 |
7 |
Electrical |
Bob Jones |
4.2 |
9 |
Mechanical |
Bob Jones |
4.8 |
8 |
Mechanical |
Bob Jones |
4.4 |
4 |
Electrical |
Bob Jones |
4.5 |
6 |
Electrical |
Dave Newton |
a) Ignore for now the months since the last maintenance service (x1) and the repairperson who performed the service. Develop the estimated simple linear regression equation to predict the repair time (y) given the type of repair (x2). Recall that x2 = 0 if the type of repair is mechanical and 1 if the type of repair is electrical.
b) Does the equation that you developed in part (a) provide a good fit for the observed data? Explain.
c) Ignore for now the months since the last maintenance service and the type of repair associated with the machine. Develop the estimated simple linear regression equation to predict the repair time given the repairperson who performed the service. Let x3 = 0 if Bob Jones performed the service and x3 = 1 if Dave Newton performed the service.
d) Does the equation that you developed in part (c) provide a good fit for the observed data? Explain.
e) Develop the estimated regression equation to predict the repair time given the number of months since the last maintenance service, the type of repair, and the repairperson who performed the service.
f) At the .05 level of significance, test whether the estimated regression equation developed in part (e) represents a significant relationship between the independent variables and the dependent variable.
g) Is the addition of the independent variable x3, the repairperson who performed the service, statistically significant? Use α = .05. What explanation can you give for the results observed?
4. Copy the first sheet in QMB3200-Homework#9Data.xlsx called “HomePrices” to your file. This sheet has some data on some homes’ appraised values and selling prices and some other fields.
a) Perform Multiple Regression Analysis by treating Selling Price as the dependent variable and all the other variables as independent variables.
b) Conduct Hypothesis tests on Regression and Individual coefficients at 0.05 level of significance. Is multiple regression between the variables statistically significant. Which ones among the independent need to appear in the model.
c) Revise your model based on your findings.
d) Write down the estimated regression equation.
e) What are the values of Coefficient of Multiple Determination and Adjusted Coefficient of Multiple Determination?
f) Comment on Goodness of Fit between the dependent variable and the independent variables.
5. Copy the second sheet in QMB3200-Homework#9Data.xlsx called “Top 50 MBA Programs” to your file. This data is according to US News and World Report, 2009 survey.
a) Perform Multiple Regression Analysis by treating Overall Rating as the dependent variable and all the other variables as independent variables.
b) Conduct Hypothesis tests on Regression and Individual coefficients at 0.05 level of significance. Is multiple regression between the variables statistically significant. Which ones among the independent need to appear in the model.
c) Revise your model based on your findings.
d) Write down the estimated regression equation.
e) What are the values of Coefficient of Multiple Determination and Adjusted Coefficient of Multiple Determination?
f) Comment on Goodness of Fit between the dependent variable and the independent variables.
1.
2.
3.
4.
5.
6.
6. The U.S. Department of Energy’s Fuel Economy Guide provides fuel efficiency data for cars and trucks. A portion of the sample data for 311 compact, midsize, and large cars follows. The column labeled Class identifies the size of the car; Compact, Midsize, or Large. The column labeled Displacement shows the engine’s displacement in liters. The column labeled Fuel Type shows whether the car uses premium (P) or regular (R) fuel, and the column labeled Hwy MPG shows the fuel efficiency rating for highway driving in terms of miles per gallon. A partial report of regression analysis is provided below. Answer the questions based on the report.
Regression Analysis: Hwy MPG versus Displacement, ClassMidsize, …
Predictor |
Coef |
SE Coef |
T |
P |
Constant |
29.7624 |
0.5521 |
53.91 |
0.000 |
Displacement |
-1.6347 |
0.1169 |
-13.98 |
0.000 |
ClassMidsize |
3.9634 |
0.3193 |
0.000 |
|
ClassLarge |
1.6450 |
0.2940 |
0.000 |
|
FuelPremium |
-1.1210 |
0.2090 |
0.000 |
S = 1.64596 R-Sq = 83.4% R-Sq(adj) = 83.2%
Analysis of Variance
Source |
DF |
SS |
MS |
F |
P |
Regression |
|
|
|
|
0.000 |
Residual Error |
|
829.0 |
|
|
|
Total |
|
4989.3 |
|
|
|
Predicted Values for New Observations
New Obs Fit SE Fit 99% CI 99% PI
1 25.3822 0.2233 (24.8033, 25.9610) (21.0767, 29.6876)
Values of Predictors for New Observations
New Obs Displacement ClassMidsize ClassLarge FuelPremium
1 3.00 0.000000 1.00 1.00
a) What is the sample size used for the analysis?
b)
c) Identify the Dependent Variable and the Independent Variables in the model.
d) Calculate ‘t’ test statistic values. Complete the ANOVA Table.
e) Write down Hypothesis Statements. Conduct both p-value and critical-value based hypothesis tests at 0.01 level of significance. Is the Multiple Regression between “Hwy MPG”, Car “Class”, Engine “Displacement”, and “Fuel Type” statistically significant?
f) Write down Hypothesis Statements. Which ones among the independent variables are statistically significant at 0.01 level of significance? How are you able to determine the same?
g) Write down the estimated regression equation in terms of the dependent and independent variables for the given problem (Use variable names – Do not use y, x … etc.).
h) Interpret the coefficient for the variable “Displacement”.
i) What are the values of Multiple Coefficient of Determination and Adjusted Multiple Coefficient of Determination?
j) Verify Why R-Sq and R-Sq (adj) values are equal to 83.4% and 83.2% respectively.
k) What is the interpretation of R-Sq = 83.4%?
l) Would you recommend using the estimated regression equation? What is your basis?
m) What is the expected Hwy MPG for “Compact” cars with “Displacement” = 1.6 Liters when “Regular Fuel” is used?
n) What is the expected Hwy MPG for: “Large” cars with “Engine Displacement” = 3.0 Liters when “Premium Fuel” is used (which is the new observation in the report above)?
o) What is the 99% interval estimate on “mean” Hwy MPG for “Large” cars with “Engine Displacement” = 3.0 Liters when “Premium Fuel” is used (which is the new observation in the report above)?