Data Statistics Assignment 2
(a) By constructing an appropriate QQ plot, determine if the sample of appears to be normally distributed [3 marks].
(b) Construct a 95% two-sided confidence interval for mean fish length (i.e. is expected value of )
(c) Using significance level , perform a test to determine if the median length of fish with very high mercury concentration is less than 54cm. Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
(d) Using significance level , perform a test to determine if mean length of fish from the
Waccamaw river is higher than mean length of fish from the Lumber River. Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
mathematical language [1 mark].
(a) Construct a single chart that displays eight boxplots (one for each combination of factor levels) [2 marks].
(b) Write down the statistical model for a completely randomised block design consistent with the sample data, excluding interaction between the factors [2 marks].
Identify the treatments and measurement units [2 marks].
(c) Using significance level , perform two-way ANOVA (without interaction) and document the -test for the experimental factor. Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
(d) Using significance level , perform Tukey post-hoc analysis on the experimental factor for each level of the blocking factor and determine which pairs of concentrations of mercury are associated with statistically different average fish lengths [2 marks].
(e) Using diagnostic plots of the residuals, assess whether the assumptions of normality and constant variance have been met [2 marks].
In this question for MBA assignment expert, we build a simple linear regression to model the relationship between engine power ( and engine displacement . We consider the population model where .
(a) Fit the model described above, write down the regression equation [1 mark] and use the model to calculate the difference in predicted average engine power for vehicles with a 25 cubic inch difference in engine displacement [2 marks].
(b) Construct a scatter plot of on the vertical axis and on the horizontal axis and superimpose the fitted regression line over the top [3 marks].
(c) Using 0.05 significance level, test whether average engine power increases by more than 0.33 units for each additional cubic inch in engine displacement. Write down the null and alternative hypotheses [1 mark], the test statistic and p- value [1 mark], the test decision with reason [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
(d) Is there any statistical evidence against the assumption of independent errors [2 marks]?
(e) Provide an estimate of [2 marks].
In this question we extend the model from Q3 into a multiple linear regression. We now consider the population model where Note that R will create the dummy variable automatically.
(a) Fit the model described above, write down the regression equation that applies for vehicles with a manual gearbox [1 mark] and provide interpretations of the estimated coefficients and [2 marks].
(b) Using 0.05 significance level, determine if the regression is significant. Write down the null and alternative hypotheses [1 mark], the test statistic and p- value [1 mark], the test decision with reason [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
(c) Compute the 95% mean confidence interval for engine power of a vehicle with a manual gearbox, with engine displacement 295in3 and weight 2875lbs [2 marks].
(d) Using 0.05 significance level perform a normality test on the residuals of the fitted model. Write down the null and alternative hypotheses [1 mark], the test statistic and p-value [1 mark], the test decision with reason [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
(e) Is there any indication of multicollinearity in the fitted model [2 marks]?
(f) Provide the Cook’s D of the most influential point [1 mark], refit the regression model on sample data excluding this point and write down the fitted equation [2 marks].
a)
R Code
In QQ plot, when the data follows normal distribution, then the points tend to fall on the 450 trend line. On the other hand, when the data violates the normality assumption, then it is expected that the point deviate away from the 450 trend line. From the above QQ plot it is clearly seen that the points move away from the 450 trend line indicating that the distribution of fish length do not follow normal distribution.
b)
R Code
The 95% confidence interval for the true mean fish length is calculated and is given below
Thus, the 95% confidence interval for the true mean fish length is (38.7 cm, 41.3 cm).
This indicates that when repeated samples are taken from the same population, then 95 out of 100 times the true mean fish length will fall within this interval.
c)
R Code
Null Hypothesis: H0: Md ≥ 54
That is, median length of fish with very high mercury concentration is not less than 54 cm
Alternative Hypothesis: Ha: Md < 54
That is, median length of fish with very high mercury concentration is less than 54 cm
The workings of Wilcoxon signed rank test is given below
P – Value = 0.02
From the above output, we see that the value of test statistic is 25 and its corresponding p – value falls below 0.05, indicating that there is sufficient statistical evidence to reject the null hypothesis at 5% level of significance. Therefore, we conclude that median length of fish with very high mercury concentration is less than 54 cm.
d)
R Code
Null Hypothesis: H0: µLumber = µWaccamaw
That is, mean length of fish from the Waccamaw River is not higher than mean length of fish from the Lumber River
Alternative Hypothesis: Ha: µLumber < µWaccamaw
That is, mean length of fish from the Waccamaw River is higher than mean length of fish from the Lumber River.
The workings of t test is given below:
The value of t test statistic is – 0.7
The p – value is 0.2
Here, the p – value of t test statistic falls above 0.05, indicating that there is insufficient evidence to reject the null hypothesis at 5% level. Therefore, we there is no statistical evidence to conclude that the mean length of fish from the Waccamaw River is higher than mean length of fish from the Lumber River.
a)
R Code
b)
The complete randomized block design is given below
Length = µ + Riverj + Mercuryi + Riverj * Mercuryi
The treatment effects are River (two levels Lumber and Waccamaw) and Mercury (four levels, low, medium, high and very high).
c)
R code
Main Effect River
Null Hypothesis: H0 µ1 = µ2
That is, there is no mean difference in the fish length between the two rivers
Alternative Hypothesis: Ha µ1 ≠ µ2
That is, there is a mean difference in the fish length between the two rivers
Main Effect Mercury
Null Hypothesis: H0 µ1 = µ2 = µ3 = µ4
That is, there is no mean difference in the mercury concentration among the four groups
Alternative Hypothesis: Ha µi ≠ µj
That is, at least one pair mercury concentration mean fish length differ significantly
The two – way ANOVA output is given below
From the above output, we see that the value of f test statistic for main effect River is 0.91 and its corresponding p – value is 0.34 > 0.05, indicating that there is no difference in the mean fish length between the two rivers.
From the above output, we see that the value of f test statistic for main effect Mercury is 37.33 and its corresponding p – value is 0.000 < 0.05, indicating that at least one pair mercury concentration means fish length differ significantly.
That is, the fish length was significantly influenced by mercury concentration.
d)
R Code
From the above Tukey Post Hoc test, we see that
Mean Fish length is high for medium mercury concentration than that of low mercury concentration.
Mean Fish length is high for high mercury concentration than that of low mercury concentration.
Mean Fish length is high for very high mercury concentration than that of low mercury concentration.
Mean Fish length is high for high mercury concentration than that of medium mercury concentration.
Mean Fish length is high for very high mercury concentration than that of medium mercury concentration.
Mean Fish length is high for very high mercury concentration than that of high mercury concentration.
d) Residual Plots
The normal probability plot of residuals validates the normality assumption.
a)
R Codes
q3q4.data<-read.csv("F:/ q3q4data.csv",header=TRUE,colClasses=c(rep("numeric",times=3),"factor"))
fit<-lm(pow~disp,data=q3q4.data)
summary(fit)
The regression equation is
Engine Power = 45.7345 + 0.4376 * Engine Displacement
When engine displacement = 25 cubic inch, we have
Engine Power = 45.7345 + 0.4376 * 25 = 56.6745
b)
R Code
c)
Null Hypothesis: H0: β1 = 0.33
That is, average engine power do not increases by more than 0.33 units for each additional cubic inch in engine displacement
Alternative Hypothesis: Ha: β1 > 0.33
That is, average engine power do increases by more than 0.33 units for each additional cubic inch in engine displacement.
The value of t test statistic workings is given below
Thus, the value of t test statistic is 1.7411 and its corresponding p – value is 0.046
Since the p – value falls below 0.05, we conclude that the average engine power do increases by more than 0.33 units for each additional cubic inch in engine displacement.
d)
From the above residual plot, it is clearly seen that the assumption of independent errors is satisfied.
e)
Engine Power = 45.7345 + 0.4376 * Engine Displacement
a)
R Code
Engine Power = - 11.86 + 0.516 * Engine Displacement + 50.39 * Manual + 5.907 * Weight
The coefficient of manual gearbox is 50.39, indicating that when the car is with manual gearbox, then the engine power increases by 50.39 hp, provided other independent variables held constant.
b)
Null Hypothesis: H0: βi = 0
That is, the regression coefficients do not differ significantly from zero
Alternative Hypothesis: Ha: β1 ≠ 0
That is, the regression coefficients differ significantly from zero
The value of f test statistic is 34.3 and its corresponding p – value at (3, 28) degrees of freedom is 0.000000145
Since the p – value falls well below 0.05, there is sufficient statistical evidence to conclude that the estimated regression model is good fit in predicting engine power
c)
Thus, the 95% mean confidence interval for engine power of a vehicle with a manual gearbox, with engine displacement 295in3 and weight 2875lbs is (173, 242).
d)
R Code
Null Hypothesis: H0:
That is, the distribution of residuals follows normal
Alternative Hypothesis: Ha:
That is, the distribution of residuals do not follow normal
The value of W test statistic is 0.9 and its corresponding p – value is 0.004
Reject the null hypothesis since the p – value falls below 0.05
Since the p – value falls below 0.05, we conclude that residuals of the fitted model violates the normality assumption.
e)
R code
Here, the VIF for weight is greater than 5, indicating that there is severe correlation between a weight and displacement or weight and manual gearbox.
Here, the VIF for displacement falls between 1 and 5, indicating that there is moderate correlation between a weight and displacement or displacement and manual gearbox.
Thus, there exists multicollinearity.
f)
From the Cooks Distance, there is a clear evidence of most influential points are 15, 29 and 31
Engine Power = - 11.86 + 0.516 * Engine Displacement + 50.39 * Manual + 5.907 * Weight