Final Project Part 3Final Data Analysis Report

Due: December 20, 2022 by 11:59PM ET on Quercus

No late submissions will be accepted

Goal of the Assessment:

Part 3 of the Final Project is your opportunity to demonstrate all that you have learned

throughout the course. This will be done by showing the teaching team that you can use the

methods and techniques learned in the course appropriately. You can use the feedback that

you have received in Part 1 and 2, as well as in the video project to write a report that is in a

common research paper format (IMRD: Introduction, Methods, Results, Discussion). Writing

these kinds of reports is likely something that, as a graduate student or a statistician working in

industry, you will find yourself doing occasionally.

Since this assignment is used to assess how familiar you are with the use of the tools and

methods from this course only, you should NOT use materials that were not covered in this

course. Instead, focus on showing us how much you know about everything we have discussed

throughout the term.

It can also be used as part of a dossier when applying to jobs to showcase your abilities as a

statistician and data analyst.

General Instructions:

Using only methods and techniques presented in the lecture slides throughout the term, you

are tasked with answering your proposed research question by creating the ‘best’ linear

regression model that meets the requirements of your research question. You will then need to

write a report (details below) that (i) introduces your research question and presents some

background, (ii) outlines the steps in your analysis that you followed to reach the ‘best’ model,

(iii) presents the results of your analysis and describes and justifies the decisions you made, and

finally (iv) discusses the final model, its interpretation and its limitations in terms of its ability to

meet your research goals. It should be made clear whether you are aiming for a model that

makes good predictions, or a model that is more descriptive and easier to interpret, or some

combination of both.

The feedback and work you have put into Part 1 of the final project should help you structure

your report in a professional and easy-to-read fashion, as well as provide you with a good

beginning to your introduction section. You may want to consider adding some additional

background research or more discussion about how your research question is important and

different from the background you present. The EDA portion of part 1 should be helpful in

writing the beginning of the results section, where you display the characteristics of the data

you will use to answer your question.

The feedback and work you have put into Part 2 of the final project should help you structure

the methods section of your report, where you will outline the process you followed/tools and

methods you used to answer your research question. The feedback should also help you with

how you approach your data analysis itself.

How to present your final report:

Once you have decided upon the ‘best’ model to fulfill the goal of the project, you must write

up a short scientific report. There should be 4 main sections of your report:

•

•

•

•

Introduction section: where you introduce the purpose and relevance/importance of

the project and provide some relevant background information on the topic (no results

or data should be presented here).

Methods section: where you describe and explain the methods, tools and techniques

used to arrive at your final model (no results or data should be presented here, but you

can tell us where you found your data and what variables it contains).

Results section: where you present a numerical/graphical description of your study

sample and important results that led you to make crucial decisions in building your

model (following the methods you outline in the earlier section), followed by the final

model and any other important results

Discussion section: where you interpret your final model and describe why it answers

the research question and why it is important, as well as discuss any limitations that still

exist based on your results.

You may use tables and plots to help present your results, but they must be relevant and wellthought-out to convey as much information as possible without being too overwhelming or

confusing. When explaining your methods, try to avoid just stating that you used a specific

method, but add an explanation for how it is used to achieve a specific task. When presenting

your results, avoid repeating exactly what you wrote in your methods section. Instead, focus on

the results of the process you described earlier, and use numerical values/graphical results to

support the decisions you made in arriving at your final model. See the rubric for more

information regarding the various report components.

If you want more information about how to structure your report and what should be

contained in each section, see this cheat sheet and this outline for reports (you may ignore the

abstract portion since you do not need one). Note that not all the elements in these resources

need to be included in your report. But you can use these to better understand how to

structure your submission.

Finally, if you use any external resources outside of the lecture slides, e.g. to provide

background on your topic, you should include a reference section at the end of your report. You

may follow APA citation styles to help format your references. For some resources on how to

cite, see the library page on citations.

What to do if you want to change your dataset or research question:

If you wish to change your dataset or research question from what was originally proposed in

Part 1, you are allowed to do so. However, you will need to provide a written statement that

proposes the change you wish to make. In order to change your dataset or research question,

you will need to submit a 1-page document (to be submitted by December 4 at 11:59PM ET on

Quercus) that answers the following two questions:

1. Why are you changing your topic or dataset? Elaborate on what made your original

dataset or topic not appropriate for the final project.

2. What makes your new topic and/or dataset more appropriate than the previous one?

Be sure to clearly state your new research question and provide a short, written

description of where you located your dataset and what information it contains.

The instructor will then approve or provide suggestions to improve your new dataset/research

question.

Technical Requirements of the Final Report:

Your report should be typed using whatever software you prefer but must be saved and

submitted as a PDF or .docx file on Quercus. Your report must meet the following

requirements:

•

•

•

•

•

•

Font: 12-point font in a style similar to Times New Roman (this is the default in R

Markdown)

Spacing: single-spaced

Word count: up to a maximum of 1500 words in total (this does not include captions on

figures and tables, however, you should also not make captions excessively long or

contain information that isn’t mentioned in the main text). We will still accept a report

that exceeds the word limit by no more than 150 words.

Number of tables/figures in the main report: 5 in total, but you may use any

combination of tables and figures

Figures and table captions: all figures and tables included should include a caption that

describes what is being presented (caption not included in the word count).

o Captions should not contain information that is not also discussed in the main

report

Figure properties:

o All plots should have an appropriate title and axis labels, avoiding the use of

variable names as they appear in the dataset

o A figure may include multiple individual plots but they should be related to each

other and make sense as to why they are being presented together

§ Avoid having too many plots in the same figure to ensure that they are

legible and clear.

•

•

•

Reference list or bibliography at the end of the report (will not count towards word

count), using appropriate citation style

Appendix: you may add an appendix at the end of your report to include some

additional tables or figures that were not important enough to be part of the main

report, but still relevant to your analysis:

o up to 3 additional tables/figures but they should only be included if they are

relevant to the analysis and are referred to in the main text.

R code: In a separate file (i.e. RMD file), you should upload your cleaned and complete

version of the R code that was used to conduct your analysis. The R code should be wellorganized and commented appropriately to indicate what each line/section of code is

doing.

Checklist for submitting final project part 3:

1. Your final written report which follows the requirements above.

2. Your R code that shows your complete analysis (this will be used to verify the results

displayed in your written report and will not be assessed for content).

Things to keep in mind while writing your final report:

o You do not need to write out the results of every step you took in your analysis as this

will make your report too long.

o Instead, focus on summarizing the most important results, especially where a big

decision was made. You need to justify it any big decisions.

o For the rest of your results, very short mentions of the process with a brief piece

of evidence provided are enough to allow your reader to follow your analysis and

understand how you arrived at the final model.

o Rather than presenting the results of each step separately (e.g creating separate tables

for each), consider putting together one larger table that you can refer to in your

discussion of many steps in your analysis so that you don’t use too much space

o For example, if you are selecting between a few different models, you could

consider presenting a table that includes many different summaries of the fit of

each model and refer to each part as needed in the text, instead of making

individual tables for each component.

o Avoid using R output taken directly from R/RStudio. Instead create your own tables

where you select only the relevant pieces of the output to display.

o Generally, the methods and results sections tend to be the longest sections, while the

introduction and discussion tend to be shorter.

o Keep this in mind when deciding how much background to provide in your

introduction. Often just a paragraph or two is plenty, given the word limits in this

project.

o However, make sure you leave yourself enough space for a solid discussion

where you can discuss the impact of the limitations that may exist in your model.

Excellent (3 points)

Report Characteristic

Introduction

Section

Methods

Section

Introduction of the

study

Variable Selection

Satisfactory (2

points)

Needs Improvement

or Meets

Completion

Requirement (1

point)

The goal of the study The goal of the study The goal of the study

is clear AND an

is not quite clear OR is not clear AND an

explicit explanation

an explicit

explicit explanation

of how this study

explanation of how

of how this study

differs or agrees

this study differs or

differs or agrees

with existing

agrees with existing with existing

literature is

literature is not

literature is not

provided.

provided.

provided.

The statistical tools

The statistical tools

The statistical tools

proposed to find a

proposed to find a

proposed to find a

final model are

final model are not

final model are not

described correctly

described correctly

described correctly

AND when they will

AND/OR when they

AND when they will

be used in the

will be used in the

be used in the

analysis is explained analysis is not

analysis is not

AND how

explained clearly

explained clearly

conclusions are

AND/OR how

AND how

made from these

conclusions are

conclusions are

tools is correctly

made from these

made from these

mentioned.

tools is not correctly tools is not correctly

mentioned.

mentioned.

Missing or Does not

Meet Completion

Requirement (0

points)

The introduction

section is not

included.

The variable

selection section is

not included.

Model Validation

Model Violations

and Diagnostics

Results

Section

Description of Data

How the model will

be validated is

clearly explained

with sufficient

details AND the

method proposed is

appropriate

How and when

model violations and

all diagnostics will be

performed is clearly

and correctly stated

AND how each will

be handled is

explained clearly and

correctly

Numerical/Visual

summaries of each

variable are

presented AND

important features

of the data are

discussed correctly.

How the model will

be validated is

mentioned AND the

method proposed is

appropriate but

needs more details

How the model will

be validated is very

unclear OR has many

details missing OR

the method

proposed is not

appropriate.

How and when

How and when

model violations and model violations and

all diagnostics will be all diagnostics will be

performed is not

performed is either

either clearly or

not clearly or

correctly stated OR

correctly stated AND

how each will be

how each will be

handled is either not handled is either not

explained clearly or

clearly explained or

not correct

not correct.

Numerical/Visual

Numerical/Visual

summaries of each

summaries of each

variable are not

variable are not

presented OR

presented AND

important features

important features

of the data are not

of the data are not

discussed or are

discussed or are

incorrect.

incorrect.

The model

validation section is

not included.

The model violation

and diagnostic

section is not

included.

The description of

the data section is

not included.

Presenting the

Analysis Process and

the Results

Goodness of the

Final Model

Sufficient detail is

provided to clearly

understand the

process taken to

arrive at final model

AND the process is

correct AND the

evidence presented

supports decisions

made

The final model has

been validated

correctly AND has

had model

assumptions verified

(and appropriately

corrected if

applicable) AND all

appropriate model

diagnostics have

been performed

Insufficient detail is

provided to clearly

understand the

process taken to

arrive at final model

OR the process is not

entirely correct OR

the evidence

presented does not

always support

decisions made or

evidence is lacking.

Insufficient detail is

provided to clearly

understand the

process taken to

arrive at final model

AND/OR the process

is not entirely

correct AND/OR the

evidence presented

often does not

support decisions

made or evidence is

lacking.

The final model has

The final model has

not been validated

not been validated

(or has been

(or has been

incorrectly

incorrectly

validated) OR has

validated) AND/OR

not had model

has not had model

assumptions verified assumptions verified

(or not appropriately (OR not

corrected if

appropriately

applicable) OR all

corrected if

appropriate model

applicable) AND/OR

diagnostics have not all appropriate

been performed

model diagnostics

have not been

performed

The presentation of

the analysis process

and results section

is not included.

The goodness of the

final model section

is not included.

Final Model

Interpretation and

Importance

Discussion

Section

Limitations of the

Analysis

An interpretation (in

context and correct)

is provided for at

least one coefficient

in the final model

AND a general

summary of what

the model tells us

about the

relationship

between predictors

and response is

provided AND it is

emphasized how the

final model answers

the research

question.

All lingering

problems with the

final model are

correctly mentioned

AND their potential

impact on usefulness

of final model

correctly discussed

AND a correct

justification is

provided for

why/how they could

not be corrected.

No coefficient in the

final model has been

correctly interpreted

in context OR a

general summary of

what the model tells

us about the

relationship

between predictors

and response is not

provided OR it is not

emphasized how the

final model answers

the research

question.

No coefficient in the

final model has been

correctly interpreted

in context AND/OR a

general summary of

what the model tells

us about the

relationship

between predictors

and response is not

provided AND/OR it

is not emphasized

how the final model

answers the

research question.

The final model

interpretation and

importance section

is not included

Some lingering

problems with final

model are correctly

mentioned OR their

potential impact on

usefulness of final

model not correctly

discussed OR a

correct justification

is not provided for

why/how they could

not be corrected.

Few of the lingering The limitations of

problems with final

the analysis section

model are correctly

is not included.

mentioned AND/OR

their potential

impact on usefulness

of final model not

correctly discussed

AND/OR a correct

justification is not

provided for

why/how they could

not be corrected.

Clarity and Length

General

Report

Quality

Use of Plots and

Tables

The report meets

word count AND is

written with very

few grammatical or

spelling mistakes

AND the report is

well structured with

appropriate sections

AND meets all

technical

requirements for the

report.

The report does not

satisfy at most 1 of

the following: meets

the word count

AND/OR is written

with few

grammatical or

spelling mistakes

AND/OR the report

well-structured

AND/OR the report

meets all technical

requirements

Plots/tables in the

Plots/tables in the

main text are clear

main text are a bit

and relevant for the unclear or are not

analysis AND

very relevant for the

plots/tables in the

analysis OR some

appendix are

plots/tables in the

referred to in the

appendix are not

main text and are

referred to in the

useful to the report

main text or are not

AND all plots/tables useful to the report

are correctly labelled OR not all

and captioned and

plots/tables are

have meaningful

labelled and

titles and axis labels. captioned correctly

or lack meaningful

titles and axis labels

The report does not

satisfy at most 2 of

the following: meets

the word count

AND/OR is written

with few

grammatical or

spelling mistakes

AND/OR the report

well-structured

AND/OR the report

meets all technical

requirements

Plots/tables in the

main text are not

clear and/or are not

very relevant for the

analysis AND/OR

some plots/tables in

the appendix are not

referred to in the

main text and/or are

not useful to the

report AND/OR all

plots/tables are not

labelled and

captioned correctly

or lack meaningful

titles and axis labels.

The report does not

satisfy 3 or more of

the following:

meets the word

count AND/OR is

written with few

grammatical or

spelling mistakes

AND/OR the report

well-structured

AND/OR the report

meets all technical

requirements

There are no plots

and tables used.

Meets Submission Requirements

–

–

R code is provided

AND final report is

submitted in the

correct format.

R code is not

provided OR final

report is not

submitted in the

correct format.

IMRD Cheat Sheet

Abstract

Abstracts can vary in length from one paragraph to several pages, but they follow the IMRaD format and

typically spend:

• 25% of their space on importance of research (Introduction)

• 25% of their space on what you did (Methods)

• 35% of their space on what you found: this is the most important part of the abstract (Results)

• 15% of their space on the implications of the research (Discussion)

I

ntroduction & Importance (Make a case for your new research)

Begin by explaining to your readers what problem you researched and why the research is necessary.

Convince readers that it is important that they continue to read.

Discuss the current state of research in your field, expose a “gap” or problem in the field, and then explain why your present research is a timely and necessary solution to that gap. See Novelty Handout.

M

ethods (What did you do?)

Methods are usually written in past tense and passive voice with lots of headings and subheadings.

This is the least-read section of an IMRaD report.

R

esults (What did you find?)

Results are where the findings and outcomes of the research go. When talking about this data, we

can think of the results as having two parts: report and comment. The reporting function always appears in the results section while the comment function can go in the discussion section. Make sure all

tables and figures are labeled and numbered separately. Captions go above tables and beneath figures.

(See Example on Page 3)

Report

Comment

D

1. Refer to your table or figure and state the main trend

Table 3 shows that Spam Filter A correctly filtered more junk emails than Filter B

2. Support this trend with data

Filter A correctly filtered…

The average difference is…

3. (If needed) Note any additional, secondary trends and support them with data

In addition… Figure 1 also shows…

4. (If needed) Note any exceptions to your main trends or unexpected outcomes

However…

5. (If needed) Provide an explanation

A feasible explanation is….

This trend can be explained by…

6. (If needed) Compare to other research

X is consistent with X’s finding…

In contrast, Y found…

7. (If needed) Evaluate whether the findings support or contradict a hypothesis

8. State the bottom line: what does the data mean?

These findings overall suggest…

These data indicate…

iscussion (What does it mean?)

Discussion sections contain the following moves:

1. They summarize the main findings of the study. This allows readers to skip to the beginning of the

discussion section and understand the main “news” in the report.

2. They connect these findings to other research

3. They discuss flaws in the current study.

4. They use these flaws as reasons to suggest additional, future research.

5. (If needed) They state the implications of their findings for future policy or practice.

Examples

Abstract

•

•

•

•

25% (Introduction)

25% (Methods)

35% (Results)

15% (Discussion)

This experiment tests the effect of choke type and gun selection on target accuracy in order to

determine the best gun specifications. Three competent shooters of approximately equivalent

marksmanship abilities tested three different choke types (full, modified, and improved) and two

different guns (a Remington 11-87 semi-automatic and a Beretta 682 Gold E). With a confidence

level of 95%, the gun selection ended up to be the only significant factor. The Beretta was

found more accurate than the Remington possibly because the Beretta’s weight is centered in

the middle of the gun while the Remington is a little barrel-heavy. However, if the confidence

level is lowered to 90%, choke type is also significant, with the improved choke more accurate

than the modified or full. Thus, for target shooting, the most accurate combination would be the

Beretta with an improved choke.

Introduction

Methods

Results

Discussion

Introduction

Bioplastics are manufactured from renewable biomass sources rather than petroleum and other fossil fuels.1 Bioplastics may be a sustainable alternative to petroleum plastics because they use fewer fossil fuels in production and

reduce greenhouse gas emissions as they biodegrade1a. Most bioplastics are currently made from starch-based

plastics or starch-polyester blends.1b However, polylactic acid (PLA), a thermoplastic aliphatic polyester typically

derived from corn starch, tapioca or sugarcane, may become a more commercially viable option.3 PLA resembles

traditional plastic, making it acceptable to consumers, and is able to be processed on equipment already used for

petroleum plastics. PLA has been used for biodegradable medical implants, packing materials, diapers and 3D

printers. However, although PLA biodegrades under carefully controlled conditions, it is not yet compostable except

in industrial composting facilities and cannot be mixed with other recyclable materials. This limits the commercial

viability of PLA because the infrastructure to transport bioplastic waste to appropriate composting facilities has not

yet been developed.2 A device that composts PLA and other bioplastics within a home composting environment

would make PLA a more viable commercial option.3

Methods1

Sb-Doped SnS Thin Film.

Pure, stoichiometric, single-phase SnS thin films can be obtained by atomic layer deposition (ALD) from the reaction of bis(N,N’-diisopropylacetamidinato)tin(II) [Sn(MeC(NiPr)2)2, referred here as Sn(amd)2] and hydrogen sulfide

(H2S).3 Rather than using ALD as previously reported,3 SnS thin films were deposited using a modified chemical

vapor deposition (CVD) process, referred here as a pulsed-CVD, to speed up the deposit rate to ~15 times higher

than that of ALD…

Material Characterization.

Film morphology was characterized using field-emission scanning electron microscopy (FESEM, Zeiss, Ultra-55).

The film thickness was determined from cross-sectional SEM. The elemental composition of the films was determined by Rutherford backscattering spectroscopy (RBS, Ionex 1.7 MV Tandetron) and time-of-flight secondary ion

mass spectroscopy (ToF-SIMS)…

1 Sinsermsuksakul, Prasert, Rupak Chakraborty, Sank Bok Kim, Steven M. Heald, Tonio Buonassisi, and Roy G. Gordon.

“Antimony-Doped Tin (III) Sulfide Thin Films.” Chemistry of Materials. 2012 (24). 4556-4562. Web. ACS Publications. 21

Oct., 2013.

Results

A.

Table 3 shows that Spam Filter A correctly filtered more junk emails than Filter B.1 Filter A correctly filtered 88% of

junk emails whereas filter B only filtered 63% correctly.2 However, Filter A takes longer to run than Filter B.4 This

increased run time is due to the type of programming language used in Filter A.5 These findings overall suggest that

Spam Filter A is a better filter than Filter B even though it takes longer to run.8

B.

Fig. 3 shows that the electrical conductivity of the Cu-doped ZnO is much lower than that of the undoped ZnO.1 The

electrical conductivity of even the 100 ppm Cu-doped ZnO specimen was about 3 orders of magnitude lower than

that of the undoped ZnO.2 As the doped Cu content increased, the electrical conductivity gradually decreased.3 As

a result, the 1000 ppm Cu-doped ZnO had the electrical conductivity 5 orders of magnitude lower than that of the

undoped ZnO.8

Discussion

The data collected from this small study suggests that verbal instructions are not needed to

complete a simple assembly task and may even interfere with the task. The participants who

received words plus pictures made more errors, took longer to complete the task, and were less

confident that they had completed the task correctly than participants who received pictures

alone. One reason for this finding may be the simplicity of the task since none of the guidelines

we examined suggest that textual information would interfere with visual instructions.

Summarize results

Our study is hampered by the small number and homogeneity of our participants. All of our

participants were college students and this may have affected our results. Additional research

might examine whether older participants would benefit from verbal instructions accompanying

pictures. More research is also needed examining different tasks. Our study involved a highly

physical task (constructing a lego vehicle). Future research should examine how pictures and

verbal instructions might interact on a more conceptual task, such as installing and using a

software program.

Flaws

Based on this limited analysis, we recommend that instruction writers consider excluding verbal

instructions on a simple assembly task. Our results indicate that verbal instructions may in

some cases interfere with users’ abilities to follow pictorial directions.

Explain results

Future research

Implications

Lab Reports – IMRAD

The purpose of a lab report is to describe the results of an experiment or research study.

University lab reports follow the style and format of professional journal articles, which

research scientists use to share and evaluate each other’s work.

Lab report formats vary slightly among scientific disciplines, but all are based on the

IMRAD outline: introduction, materials and methods, results, and discussion. The purpose

of each section dictates what information to include, regardless of the specialty being

written for.

Helpful Tip: It is usually easiest to write the methods and results sections first, followed by

the discussion and introduction. Title and abstract (if required) should be written last.

IMRAD format:

Section

Purpose

Content and Characteristics

Title

• Describes the content of

• Clear, specific, and accurate

the report

• Loaded with keywords drawn from

• Allows scientists to

the body of the report

locate research of

interest when searching

databases

Abstract

• Summarizes the report

• One paragraph (200-‐250 words)

• Helps researchers decide • 2-‐3 sentences for each section,

whether to read the

summarizing key data and ideas

entire paper

• A complete synopsis, not a teaser

(results and discussion must be

included)

Introduction

• Gives background

• Reviews relevant literature,

information needed to

including properly formatted

understand the current

citations

research, tracing the

• Explains why the study was

development of existing

conducted, and what question it was

knowledge

designed to answer

• Places the new

• Briefly describes approach to the

experiments within the

problem

context of the field

• Outlines hypothesis(es) to be tested,

• Identifies gaps in

and predicted results

existing knowledge and

• Written in a mixture of present tense

shows how the present

(for generally accepted truths) and

research will fill them

past tense (when referencing specific

• States the specific

research

objectives of the work

© The Writing Centre, Saint Mary’s University, 2014

This handout is for personal use only. Reproduction prohibited without permission.

Lab Reports – IMRAD

2

IMRAD format continued:

Section

Purpose

Content and Characteristics

Materials and

• Explains how the

• written in paragraph format

Methods

experiments were

• materials are mentioned while

conducted

describing methods, never listed

separately

• Provides enough detail

that another scientist

• describes the purpose of each

could repeat the

procedure, as well as necessary steps

experiment

• omits details that are common

• Gives readers the

knowledge or would not impact the

information they need to

results

evaluate the validity of

• written in past tense (recounts what

results and conclusions

was done, rather than giving

instructions)

Results

• Describes the outcomes

• straightforward reporting of

of the experiments

observations and calculations

• Draws attention to key

• does not include commentary or

findings and

interpretation

relationships

• detailed data is presented in tables

• Allows readers to form

and figures, which are referenced in

their own conclusions

the text

based on the data

• written portion should summarize

and emphasize, not repeat details

shown in the visuals

• written in past tense

Discussion

• Interprets the results

• references key data, describing its

and explains their

implications

significance

• identifies any errors made during the

experiment and their impacts

• Places the new data in

the context of the field

• discusses any shortcomings of the

protocols or experimental designs

• Identifies limitations of

the study and suggests

• draws conclusions

next steps

• identifies questions that could not be

answered

• cites relevant literature

• written in past, present, and future

tense, as appropriate

References

• Provides full

• includes only literature that’s cited in

bibliographic

the text

information, directing

• follow a consistent scientific citation

the reader to relevant

style, such as APA

literature

© The Writing Centre, Saint Mary’s University, 2014

This handout is for personal use only. Reproduction prohibited without permission.

L egend:

Legend

Import data that contains

all possible pedictors into R

Week 2

Starting or end points

Week 3

Action / Process to apply

Week 4

Week 5

Check if the

variables are incorrect type or

there is any missing data

Week 6

Week 7

Decision to make

Recode and fix the

variables or remove

any missing data

Yes

Links to two halves of the

chart

Week 8

The arrows connects steps

No

Start with the full linear

model that consists of all

possible predictors

Randomly split data into two sets:

training set (70%) and test set (30%)

Use the training set to draw

scatter plots of the data

Fit linear model based on

observations of scatter plots

Interpret

violation

Check additional 2 conditions

and build residual plots

Perform hypothesis test for

coeffients of predictors in the

model, and biuld a new model

with all predictors that have

significant F-values in tests

No

Is there

any violation of

condition 1,2 or linear model

assumption?

Yes

Yes

Check constant

variance

Check

Normality

Yes

No

Fit reduced model and

perform partial F test

between reduced model and

full model

Yes

Apply (Boxcox) transformation to

both reponse and predictors, and

biuld plots for additional

conditions

If all

violations fixed after

transformations

Does the testing

result prefer the reduced

model?

Yes

No

Add back a removed predictor

that would increase the

R-squared most /decrease

BIC/AIC most

Yes

Check

Linearity

No

Identify the limitations caused

by violated assumptions

No

No

Identify all problematic

observations, including leverage

points, outliers, and influential

points

Is there any

valid reason to remove

some problematic

observations

Yes

Remove the problematic

observations and refit the

full model

No

Again, Check additional 2

conditions and biuld residual

plot for final model

Is there

any violations in

condition 1,2 or linear model

assumptions?

Yes

Do the violations

also appear in full model as we

already identified?

No

Interpret the parameters in

the final linear regression

model

Yes

Biuld up confidence interval for

average response prediction

interval for actual response, then

use the data in test set to fit for

the model

Compare the

result of training model

and test model, and see if

they are similar.

Yes

Make conclusion of

the research question based on

previous findings, and state the

limitations.

No

The model is

overfitting, discuss

limitations

No

Identify the limitations

caused by the new

appeared violations

STA302/1001: Methods of Data Analysis 1

Instructor: Katherine Daignault

Department of Statistical Sciences

University of Toronto

Week 3 (Sept. 26-30)

1 / 40

Outline

The Linear Regression Model

Modelling Conditional Means

Least Squares Estimation

Interpreting the Parameters

Introducing the Assumptions

2 / 40

Week 3 Learning Goals

In this module, we will be introduced to the linear regression

model. We will learn about how we use data to estimate our linear

relationship, and how to interpret the values we get, as well as how

different predictors yield different interpretations. We will also

introduce the assumptions needed as well as see how to create

linear models in R. To that end, the learning goals are

I to explain why regression models conditional relationships

I to apply the least squares procedure to different settings

I to estimate the parameters of a regression relationship

I to interpret the components of a regression model in the

context of a dataset

I to recognize that regression has assumptions and to

preliminary inspect them through EDA

3 / 40

Outline

The Linear Regression Model

Modelling Conditional Means

Least Squares Estimation

Interpreting the Parameters

Introducing the Assumptions

4 / 40

The Functional Component of the Relationship

I A linear regression model we saw is a statistical relationship

that defines a functional relationship between the predictor(s)

and the response, along with some random deviations.

I But in our Code-Along demo, we also discovered that while it

may not be possible to define a functional relationship for all

data points, it may be possible to do so for E (Y | X = x).

I The functional part of our statistical relationship does exactly

this!

I

We actually have that E (Y | X = x) = β0 + β1 xi

I

This says that as the value of xi increases by one unit, the

average response will change by β1 .

I So why is this the case? Let’s think about this in terms of

distributions and random variables.

5 / 40

Conditional Distributions of Responses

I In regression, we consider our predictor(s) to be fixed values,

i.e. not random variables.

I But the response value we might observe for a certain value of

the predictor is random, and thus Yi | X = xi ∼ f (y | xi ) with

a mean E (Y | X = xi ) and some variance Var (Y | X = xi ).

I The distribution tells us that the possible y values lie some

distance from these means.

I So for all responses that correspond to the predictor value xi ,

they will sit a random distance from the mean of the

distribution, which we can label i .

I Therefore, we can write Y = E (Y | X = xi ) + i , and if the

means change systematically as X changes, then

Y = β0 + β1 xi + i = E (Y | X = xi ) + i

6 / 40

The Population Relationship

I When dealing with one predictor, the relationship can be

viewed nicely.

I Even with many predictors, the results holds:

β +

Y = E (Y | X) + = Xβ

I This is the relationship that occurs in the population, and we

cannot know what E (Y | X) or β actually are.

I So we will be required to use a sample from this population to

estimate these quantities.

7 / 40

The Sample Relationship

I In our sample, we’re going to want to impose the same

statistical relationship that we think is present in the

population.

I Using our sample data, we can write out a similar relationship

between the response and predictors, Y = Xb + ê, where

I

Y and X is our observed response and predictor data

I

b is some vector of coefficients representing possible slopes

and intercept to be estimated with the data

I

ê is the observed error in the data, called residuals.

I Note that b is just an arbitrary set of coefficients and does

not yet correspond to estimates of β .

I

The reason is that we don’t yet know how to estimate β .

I But once we get our β̂

β , then the linear relationship estimated

\

β

from the data will be Ŷ = E (Y

| X) = Xβ̂

I

i.e. we can estimate the conditional means in our population.

8 / 40

Outline

The Linear Regression Model

Modelling Conditional Means

Least Squares Estimation

Interpreting the Parameters

Introducing the Assumptions

9 / 40

Poll Question 1

Go to PollEv.com/katherinedai702 or open your app (if using) and

sign in.

How much familiarity do you have with estimation

procedures?

I I know what estimation is

I I know how maximum likelihood estimation works

I I know how maximum likelihood and least squares estimation

works

10 / 40

Residuals: a measure of distance

I Residuals, the observed errors e, will play an important role in

finding estimates for β .

I In the population, the errors represent the distance

β.

= Y − E (Y | X) = Y − Xβ

I So the residuals would be estimates of that same distance

based on the data.

I The issue is we don’t know where the regression line of best

fit should be, so how do we use the residuals to estimate this

exact relationship?

11 / 40

Line of Best Fit and Residuals

I To estimate these unknown β parameters that define the

population-level relationship between X and E (Y | X), we will

need to find a line of best fit in our data.

I ‘Best fit’ in this case will mean a line that sits as close as

possible to all observed responses.

I So that means we will need to find values for the elements of

b that minimize the distance of all observations to this line.

I

In simple regression, we want to find values b0 and b1 that will

ensure the estimated line ŷi = b0 + b1 xi lies as close as

possible to all observed yi

I

We call ŷi the predicted/fitted value of yi , i.e. Ŷ is an

estimate of E (Y | X)

I The residuals naturally give us a measure of closeness to the

line, since êi = yi − ŷi (equivalently in vector form:

ê = Y − Ŷ)

12 / 40

Minimizing the Residual Sum of Squares

I So we want to find the values b0 and b1 that fit the line as

close as possible to all points.

I This can be seen as ultimately wanting to make all residuals

as small as possible.

I But it’s not practical to minimize each individual êi – rather it

makes more sense to find a single equation to minimize that

incorporates the idea of the total distance of all points from

the line.

I To do this, we define the residual sum of squares (RSS) to be

this function we will minimize:

I

Residuals can be both positive or negative so we can square

them so they don’t cancel each other out.

I

Then we can sum them all up to give us the total squared

amount of variation between the points and the line:

RSS =

n

X

êi2 = ê0 ê

i=1

13 / 40

Poll Question 2

Go to PollEv.com/katherinedai702 or open your app (if using) and

sign in.

Suppose we have a point in a three-dimensional space and

we want to project this point to a two-dimensional plane.

What will be the angle of the vector connecting the point to

its projection on the plane?

I 90 degrees

I less than 90 degrees

I more than 90 degrees

14 / 40

Geometry of Least Squares

I But why do we square the residuals instead of e.g. taking the

absolute value?

I It has to do with the geometry of the vectors and spaces we

are working with.

1h

M y is

a

vector

is our

response vector

is our

error

vector

the model space

and has dimension equal

to the number of linearly

columns in X

is called

independent

the

representing

line

regression

I The way to minimize the error vector (i.e. make it as small as

possible) is to make it perpendicular.

I Once perpendicular, we have a right angle triangle and we can

find the lengths of the vectors using Pythagoras (or Euclidean

distances)

I This requires working with squares of the vectors.

15 / 40

How does the Least Squares Process Work?

I When dealing with one predictor or multiple predictors, the

process behind finding the values of b that minimize the

residual sum of squares is the same.

1. Take partial derivatives of the RSS (your estimating equation)

with respect to each term in β .

2. Set your result (the score equation) to 0.

3. Solve for the unknown parameters by re-arranging your

expressions.

I Once we have the actual estimates, we use the more familiar

β , instead of our

way to denote an estimate of β which is β̂

placeholder values b.

16 / 40

Least Squares Estimators (Simple and Matrix-based)

17 / 40

Notes on the LS Estimators

I The algebraic (simple) version of the estimators can only be

used when estimating the relationship between the response

and one predictor.

I Since the estimator for the intercept β̂0 contains the estimate

of the slope β̂1 , you’ll need to compute the slope first.

I The LS estimator for the slope also has an alternative form:

Pn

(x − x̄)(yi − ȳ )

Pn i

β̂1 = i=1

2

i=1 (xi − x̄)

I

the denominator of this form is the sum of squared deviations

between xi and its sample mean x̄, or SXX

I

I

it’s related to the sample variance of the predictor.

The numerator is similar to the idea of covariance – looking at

the product of deviations of each variable from its mean

(sometimes labelled SXY).

18 / 40

Notes on the LS Estimators

I For multiple predictors, you’ll likely not have to work with the

individual data matrices as they are large and cumbersome.

I However, some of the component matrices in the LS estimator

can be calculated easily:

n

P x

P i1

x

i2

X0 X =

.

.

P.

xip

P

P xi12

P xi1

xi1 xi2

..

P .

xi1 xip

P

P xi2

Pxi1 x2 i2

xi2

..

P .

xi2 xip

…

…

…

…

P

x

ip

P

P xi1 xip

xi2 xip

,

..

.

P 2

xip

P

yi

P x y

P i1 i

x y

i2 i

X0 Y =

.

.

P .

xip yi

I To invert the X0 X matrix, you’ll likely need the aid of software

or would be given the inverse directly.

I Expanding the regression relationship

β = X(X0 X)−1 X0 Y, we can see that H = X(X0 X)−1 X0

Ŷ = Xβ̂

(the hat matrix) projects Y onto Ŷ and thus has all the

properties of a projection matrix (exercise: check this)

19 / 40

Exercise – Give it a try!

Suppose you have the following numerical summaries for 2

predictors and a response variable on 21 individuals. Estimate the

coefficients for this regression surface.

21

X

xi1 = 1302.4

i=1

21

X

21

X

xi2 = 360

i=1

xi22 = 6190.26

i=1

21

X

21

X

i=1

yi = 3820

i=1

xi1 xi2 = 22609.19

29.729

Use (X0 X)−1 = 0.072

−1.993

xi12 = 87707.94

i=1

21

X

xi1 yi = 249643.35

i=1

0.072

0.0004

−0.0056

21

X

21

X

xi2 yi = 66072.75

i=1

−1.993

−0.0056

0.136

20 / 40

Code-Along Session

I We will now jump into JupyterHub (jupyter.utoronto.ca)

and look into how we can estimate a linear regression

relationship on our NYC dataset. We will be doing the

following:

I

creating bivariate plots to visualize pairwise relationships

I

use the lm() function to estimate a simple and multiple linear

regression relationship

I

view and extract model estimates.

I Add the materials to JupyterHub either by downloading from

Quercus followed by uploading them to Jupyter, or clicking

the GitHub link provided on Quercus.

21 / 40

Outline

The Linear Regression Model

Modelling Conditional Means

Least Squares Estimation

Interpreting the Parameters

Introducing the Assumptions

22 / 40

What do the parameters mean?

I Now that we can estimate the statistical relationship in the

population by using our sample, what does it mean?

I Let’s consider the estimated line Ŷ = Xβ̂

β and the

β.

corresponding population mean relationship E (Y | X) = Xβ

I When we estimate a linear relationship using our sample, we

are getting an estimate for the corresponding relationship in

the population.

I

β is an estimate for the vector of parameters β

therefore β̂

Ŷ is therefore an estimate for the vector of conditional means

E (Y | X)

I if we then took an individual with predictor values

β gives us the

x0i = 1 xi1 xi2 . . . xip , then ŷi = x0i β̂

predicted value of the response for that individual

I

I

which is the same as an estimate for the mean response

conditional on those predictor values.

23 / 40

Simple Linear Regression Parameter Interpretation

I The estimated simple linear regression model is ŷi = β̂0 + β̂1 xi

I We just discussed that ŷi is the estimated mean given a value

of x.

I Keeping this in mind, we can also interpret the slope and

intercept as:

I

β̂0 is the mean/average response given the predictor is 0.

I

I

it’s important to also consider whether the intercept has a

meaningful interpretation at all.

β̂1 is the change in the mean/average response for a one unit

change in the value of the predictor.

I

it is NOT how much each response will change for a unit

increase in X, because it is not true that all responses will

change by an equal amount

I

instead it is the expected change for a unit increase in X.

24 / 40

Parameter Interpretation for a Multiple Linear Regression

I To interpret the parameters when working with many

predictors, things get a little trickier.

I Even though we worked with a vector of parameters, we still

interpret each element of that vector individually.

I The intercept is similar to before

I

β̂0 is the average/mean response when ALL predictors have a

value of 0 (assuming it’s meaningful to have a 0 value).

I However, interpreting each slope β̂j , j = 1, . . . , p individually

means we have to ensure that the only change occurring in

the predictor values is the one-unit increase in the predictor

whose parameter we are interpreting.

I

i.e. in order for us to interpret one β̂j correctly, all other

predictor values must be fixed.

I

Then, β̂j is the average/mean change in the response when Xj

increases by one unit, when all other predictors are held fixed.

25 / 40

Conditional Nature of Multiple Regression

I Another feature of working with multiple predictors in a

regression model is that we need to carefully understand the

conditional nature of regression.

I As an example, suppose we collect data on a response and

two predictors.

I We can fit three different models with these variables:

I

A simple model with only X1 , estimated to be

ŷi = 1.86 + 1.30xi1

I

A simple model with only X2 , estimated to be

ŷi = 0.86 + 0.78xi2

I

A two-predictor model, estimated to be

ŷi = 5.37 + 3.01xi1 − 1.29xi2

I If we look at the coefficient for X2 in the simple model, why

did it suddenly change directions/signs?

26 / 40

Poll Question

Go to PollEv.com/katherinedai702 or open your app (if using) and

sign in.

Why did the sign change?

I The two-predictor model was estimated incorrectly.

I The one-predictor models were estimated incorrectly.

I The two-predictor model conditions on the values of X1

27 / 40

Conditional Nature of Multiple Regression

I But what if we highlight all

points with the same value

of X1 ?

I By conditioning on the

value of X1 , we can now see

the decreasing trend

appear.

I When a model contains

I If we ignore X1 , we see the

increasing trend from the

simple model.

more than one predictor, we

must always remember that

it conditions on values of

the other predictors to

estimate each βj

28 / 40

Code-Along Session

In our second Code-Along, we will look at how to use different

types of predictors and how that changes interpretation of

coefficients. We will look at:

I creating informative bivariate plots

I fitting models to subsets of a dataset

I incorporating indicator variables and the change in

interpretation

I incorporating interaction terms and the change in

interpretation

29 / 40

Summary of interpretations

I We saw that indicator variables, depending on how they are

included in a model, change the interpretation of the coefficients.

I Suppose X1 =height and X2 = 1{Male}.

(

ŷi = β̂0 + β̂1 xi1 ,

xi2 = 0

I ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 =

ŷi = (β̂0 + β̂2 ) + β̂1 xi1 , xi2 = 1

(

I ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 ∗ xi1 =

ŷi = β̂0 + β̂1 xi1 ,

xi2 = 0

ŷi = β̂0 + (β̂1 + β̂2 )xi1 , xi2 = 1

I When an indicator takes more than 2 levels, we create dummy

variables for all but 1 of the levels and interpret similarly

I

If X2 takes values A, B, and C, then including X2 in our model

would effectively yield

ŷi = β̂0 + β̂1 xi1 + β̂2 1{Xi2 = A} + β̂3 1{Xi2 = B}

30 / 40

Activity – ∼ 5-10 minutes

Team up in groups of 2-3 people and come up with the correct

interpretation of β1 in the linear relationship below:

[ = −24.5 + 1.65Food + 1.88Decor

Price

BUT, you can only use simple words as accepted by this XKCD

Simple Word Checker (https://xkcd.com/simplewriter/).

Once you have your best answer, go to

PollEv.com/katherinedai702 or open your app (if using) and sign in

and add your answer. If you weren’t able to come up with an

answer, you can also upvote existing answers.

31 / 40

Outline

The Linear Regression Model

Modelling Conditional Means

Least Squares Estimation

Interpreting the Parameters

Introducing the Assumptions

32 / 40

Role of Assumptions in Regression

I As with many statistical procedures and methods,

assumptions are required in order for our regression line to

have important uses.

I These assumptions are necessary in order for us to be able to

make inference about the unknown model parameters.

I This includes:

I

for creating confidence intervals about the unknown model

parameters, the elements of β

I

for building statistical tests for testing possible values of the

unknown model parameters, the elements of β

I In the case of linear regression, the assumptions we make are

regarding the random error terms, .

33 / 40

Assumption 1: Linearity/Mean zero errors

A1. Linearity of the Relationship

Y is related to X by the linear regression model

β +

Y = Xβ

β

or E (Y | X) = Xβ

or E ( | X) = 0

I It’s important to realize that when we fit a linear model, we

are implicitly assuming that a linear relationship exists in the

population.

I But there’s more to this assumption than simply assuming

that it is appropriate to use a linear model

I This assumption also relates to the correctness of your model.

I

It also says that we are assuming only the predictors we are

including in X are actually related to the response

I

all remaining variation in the response should not be able to be

explained by any other predictors, but only due to random

variation.

34 / 40

Assumption 2: Uncorrelated Errors

A2. Covariance of the Errors

The errors are uncorrelated, namely Cov (i , j ) = 0, or equivalently

Cov (yi , yj ) = 0

I This just says that we require that none of the deviations from

the conditional mean be related to one another.

I

analogous to wanting random variables to be independent to

one another, or observations to be sampled independently

I We don’t want the errors to be related to each other, but

rather should appear to be independent and identically

distributed variables.

I

if they are dependent/correlated, then we are working with less

information that we thought we had

I Having correlated error terms means that the predictive ability

of the model will be worse in some areas than in others.

35 / 40

Assumption 3: Common Error Variance

A3. Common Error Variance

The errors i , i = 1, . . . , n have a common variance σ 2 .

I This assumption says that we assume that the population of

responses at any value of the predictors has the same spread.

I Constant error variance is sometimes also called

homoskedasticity.

I If it is violated, then our line will become less accurate as the

residuals become more variable.

I

so our line will accurately estimate conditional means in some

areas, but not in others.

I We want our regression to be equally good at predictions for

all values of X .

36 / 40

Assumption 4: Normality of Errors

A4. Normality of Errors

The errors are Normally distributed, such that | X ∼ Nn (0, σ 2 I),

β , σ 2 I).

or equivalently Y | X ∼ Nn (Xβ

I This assumption is particularly important for inference (CIs

and tests).

I If the errors are Normal, then it means that we can use all of

the handy properties of Normal distributions, such as linear

combinations of Normal random variables.

I In particular, this will allow us to determine the distribution of

the model parameters so that we may make inference about

them.

37 / 40

Notes on the Assumptions

I None of these assumptions were explicitly used to find the

least squares estimators for the model parameters β .

I We didn’t use maximum likelihood, we aren’t using

variance/covariance at all, but the equation we minimize

requires the linear equation we are estimating to be correct.

I It is very possible (and quite easy) to fit a linear regression

model that will not satisfy these assumptions.

I

e.g. nothing will stop you from fitting a straight line to a

curved relationship… it just won’t be particularly useful.

I However, when assumptions 1-3 are satisfied, the least squares

estimator of β will be unbiased and have minimum variance

among all other linear unbiased estimators (i.e. it’s the best

one).

I We will show later how to determine unbiasedness and to find

the variance for β .

38 / 40

Code-Along Session

Our last short Code-Along will look into techniques to very

informally check whether we might anticipate any problems with

model assumptions. These are not formal checks, but can warn

you about potential issues down the road. We will use

I Scatterplots to inspect linearity and constant variance

I Histograms to inspect linearity and normality

I Critical thinking to inspect uncorrelated errors.

39 / 40

Wrapping up

I Linear regression models attempt to describe a statistical

relationship that is occurring in a population.

I

we can use all sorts of different predictors, but it changes how

we interpret the coefficients.

I The notion of conditioning and conditional

distributions/relationships is an important one.

I

We interpret each parameter by holding other predictors fixed.

I

We will get predicted values by conditioning on values of the

predictors.

I

Estimated coefficients will change when adding more predictors

to the model because all the predictors are conditionally

related to the response.

I We also found that we will need assumptions in order to

ensure our estimators have good properties and yield the

results we expect.

40 / 40

STA302/1001: Methods of Data Analysis 1

Instructor: Katherine Daignault

Department of Statistical Sciences

University of Toronto

Week 4 (Oct. 3 – 7)

1 / 46

Outline

Assumptions and Properties of Estimators

Assumptions for Linear Regression

Properties of Residuals

Sampling Distributions of the LS Estimators

Intervals and Hypothesis Tests

For the estimated coefficients and mean response

For an actual individual response

2 / 46

Week 4 Learning Goals

This week we will learn about the assumptions that are required in

linear regression and how these yield really nice inferential

properties in our estimators of the coefficients. We will use these

to derive sampling distributions, confidence/prediction intervals,

and hypothesis tests. To that end, the learning goals are:

I use assumptions to derive properties of estimators

I compute appropriate confidence/prediction intervals and

hypothesis tests.

I conclude and interpret the results of a confidence/prediction

interval and test.

I differentiate between using a regression model to estimate a

parameter versus a future observation

3 / 46

Outline

Assumptions and Properties of Estimators

Assumptions for Linear Regression

Properties of Residuals

Sampling Distributions of the LS Estimators

Intervals and Hypothesis Tests

For the estimated coefficients and mean response

For an actual individual response

4 / 46

Role of Assumptions in Regression

I As with many statistical procedures and methods,

assumptions are required in order for our regression line to

have important uses.

I These assumptions are necessary in order for us to be able to

make inference about the unknown model parameters.

I This includes:

I

for creating confidence intervals about the unknown model

parameters, the elements of β

I

for building statistical tests for testing possible values of the

unknown model parameters, the elements of β

I In the case of linear regression, the assumptions we make are

regarding the random error terms, .

5 / 46

Assumption 1: Linearity/Mean zero errors

A1. Linearity of the Relationship

Y is related to X by the linear regression model

β +

Y = Xβ

β

or E (Y | X) = Xβ

or E ( | X) = 0

I It’s important to realize that when we fit a linear model, we

are implicitly assuming that a linear relationship exists in the

population.

I But there’s more to this assumption than simply assuming

that it is appropriate to use a linear model

I This assumption also relates to the correctness of your model.

I

It also says that we are assuming only the predictors we are

including in X are actually related to the response

I

all remaining variation in the response should not be able to be

explained by any other predictors, but only due to random

variation.

6 / 46

Assumption 2: Uncorrelated Errors

A2. Covariance of the Errors

The errors are uncorrelated, namely Cov (i , j ) = 0, or equivalently

Cov (yi , yj ) = 0

I This just says that we require that none of the deviations from

the conditional mean be related to one another.

I

analogous to wanting random variables to be independent to

one another, or observations to be sampled independently

I We don’t want the errors to be related to each other, but

rather should appear to be independent and identically

distributed variables.

I

if they are dependent/correlated, then we are working with less

information that we thought we had

I Having correlated error terms means that the predictive ability

of the model will be worse in some areas than in others.

7 / 46

Assumption 3: Common Error Variance

A3. Common Error Variance

The errors i , i = 1, . . . , n have a common variance σ 2 .

I This assumption says that we assume that the population of

responses at any value of the predictors has the same spread.

I Constant error variance is sometimes also called

homoskedasticity.

I If it is violated, then our line will become less accurate as the

residuals become more variable.

I

so our line will accurately estimate conditional means in some

areas, but not in others.

I We want our regression to be equally good at predictions for

all values of X .

8 / 46

Assumption 4: Normality of Errors

A4. Normality of Errors

The errors are Normally distributed, such that | X ∼ Nn (0, σ 2 I),

β , σ 2 I).

or equivalently Y | X ∼ Nn (Xβ

I This assumption is particularly important for inference (CIs

and tests).

I If the errors are Normal, then it means that we can use all of

the handy properties of Normal distributions, such as linear

combinations of Normal random variables.

I In particular, this will allow us to determine the distribution of

the model parameters so that we may make inference about

them.

9 / 46

Notes on the Assumptions

I None of these assumptions were actually needed in order to

find the least squares estimators for the model parameters β .

I This is because the least squares process is a distribution-free

estimation method.

I It therefore means that it is possible (and quite easy) to fit a

linear regression model that will not satisfy these assumptions.

I

e.g. nothing will stop you from fitting a straight line to a

curved relationship… it just won’t be particularly useful.

I However, when assumptions 1-3 are satisfied, the least squares

estimator of β will be unbiased and have minimum variance

among all other linear unbiased estimators (i.e. it’s the best

one).

I We will show later how to determine unbiasedness and to find

the variance for β .

10 / 46

Code-Along Session

Our first short Code-Along will look into techniques to very

informally check whether we might anticipate any problems with

model assumptions. These are not formal checks, but can warn

you about potential issues down the road. We will use

I Scatterplots to inspect linearity and constant variance

I Histograms to inspect linearity and normality

I Critical thinking to inspect uncorrelated errors.

11 / 46

Poll Question 1

Go to PollEv.com/katherinedai702 or open your app (if using) and

sign in.

Are these preliminary checks on assumptions enough to know

for certain whether the assumptions on the errors hold?

I Yes

I No

12 / 46

Outline

Assumptions and Properties of Estimators

Assumptions for Linear Regression

Properties of Residuals

Sampling Distributions of the LS Estimators

Intervals and Hypothesis Tests

For the estimated coefficients and mean response

For an actual individual response

13 / 46

Estimator of the Error Variance

I In many of the assumptions listed in the previous section, we

are working with errors and an error variance – all elements of

the population that need to be estimated.

I We’ve already seen that the residuals of our least squares

regression model are observations of the population errors.

I

namely ê is an observation for

I So how do we find an estimate of the error variance, σ 2 ?

I If Var (i ) = σ 2 = E [(i − E (i ))2 ] = E (2 ) by the

i

assumptions, then a reasonable estimator would involve

averaging the square of the observed residuals.

I We actually get that the estimate of the error variance is

Pn

Pn

2

(yi − ŷi )2

ê0 ê

RSS

i=1 êi

=

=

= i=1

s =

n−p−1

n−p−1

n−p−1

n−p−1

2

where p is the number of predictors in the model.

14 / 46

Estimator of the Error Variance

I You may be asking, why are we not dividing by n if we are

taking an average with a sample?

I We could do that, and it would be an estimate of the error

variance too.

I However, if we use s 2 , we would get a better estimate (i.e.

unbiased).

I Intuitively, we use n − p − 1 as a divisor because we have

estimated p + 1 parameters in the regression model and have

to account for these new values by taking information away

from the sample.

I

This is the same reason why, when we compute a sample

variance, we divide by n − 1 instead of n.

I

We need to account for having used the data once before to

estimate the sample mean, and so we take away one data

value for this newly introduced information.

15 / 46

Notes on the LS Estimator of the Error Variance

I The estimate of the error variance s 2 is an unbiased estimate

of σ 2

I

For details on how to prove unbiasedness of this estimate, see

Rencher Chapter 2 and 5.

I

We won’t go into these details here because they utilize

properties of quadratic forms which are not something that

everyone may be familiar with.

I We will soon see that the LS estimator for β is unbiased.

I Turns out both the LS estimators for β and the error variance

σ 2 is also the ‘best’ ones.

They will also have minimum variance among all other

unbiased estimators of a particular type.

I However, where the LS estimator of β is best among all

unbiased linear estimators, s 2 is best among all unbiased

quadratic estimators.

I

I

This is because it is expressed as a quadratic equation or

quadratic form.

16 / 46

Outline

Assumptions and Properties of Estimators

Assumptions for Linear Regression

Properties of Residuals

Sampling Distributions of the LS Estimators

Intervals and Hypothesis Tests

For the estimated coefficients and mean response

For an actual individual response

17 / 46

Properties of LS Estimators

I It’s always important to learn about the properties of your

estimators.

I Specifically, we want to know whether the LS estimator β̂

β is

unbiased, how variable it is, and whether we can determine its

sampling distribution.

I This will involve working with the errors/residuals as well as

the assumptions.

I As a reminder, our assumptions essentially can be combined

to be

β , σ 2 I)

Y | X ∼ Nn (Xβ

I We will use these assumptions to determine the sampling

β

distribution of β̂

I We will also use results from Review Slides of Week 0.

18 / 46

Covariance Matrices

I Before jumping into our derivation, let’s remind ourselves of

how a covariance matrix works.

I Everything we do will come down to working with the

distribution of the errors

| X ∼ Nn (0, σ 2 I)

I This says the vector of random errors has a mean vector of 0

and a covariance matrix which is a diagonal matrix with σ 2

along the main diagonal and 0’s elsewhere.

I

The main diagonals represent Var (i ) = σ 2 for all i, and the

off diagonal elements are Cov (i , j ) = 0 for all i 6= j.

I

So when working with a vector of random variables, you work

with a covariance matrix so that you have information about

the individual variances but also how the elements of the

vector co-vary with each other.

19 / 46

β

Expectation and Covariance of β̂

20 / 46

Poll Question 2

Go to PollEv.com/katherinedai702 or open your app (if using) and

sign in.

How many of the assumptions did we use in deriving these

properties?

I None

I One

I Two

I Three

I Four

21 / 46

Coefficients are not uncorrelated

I Consider the covariance matrix of β̂

β in simple linear regression:

2

0

−1

β ) = σ (X X)

Cov (β̂

σ2

=

SXX

Pn

2

/n −x̄

−x̄

1

i =1 xi

!

(exercise: check that you can derive this).

I Knowing that the covariance matrix will look similar (but

much larger) for multiple regression, we can see that the

off-diagonal elements would not necessarily equal 0.

I This tells us that the estimated coefficients of any two

predictors in a multiple linear model may be correlated.

I Even in simple linear regression, the slope and intercept may

have a non-zero covariance.

I This again demonstrates the conditional nature of regression

and how we must always consider how the components we

work with co-vary/are related.

22 / 46

β

Sampling Distribution of β̂

I Based on the assumptions, we have that Y | X ∼ Nn (Xβ

β , σ 2 I)

I Even though we are working with a multivariate Normal here,

it still follows the same rules regarding linearity of Normal

random variables (see Week 0 Review Slides).

I We have that β̂

β is a linear combination of Normal random

variables because

β = (X0 X)−1 X0 Y = AY

β̂

I Linearity of Normals says AY ∼ Nn (Aµ

µy , AΣ

ΣA)

I The mean and covariance matrix for our Normal distribution

were found to be β and σ 2 (X0 X)−1 respectfully (and were

found doing exactly this process).

I Therefore the sampling distribution of β̂

β is

β , σ 2 (X0 X)−1 )

Np+1 (β

23 / 46

Poll Question 3

Go to PollEv.com/katherinedai702 or open your app (if using) and

sign in.

Given a covariance matrix for β from a model that fit 3 predictors,

where would we find the variance of β2 in this covariance

matrix?

I position (1, 1)

I position (2, 2)

I position (3, 3)

I position (4, 4)

24 / 46

Estimating the Variance in the Sampling Distribution

I The sampling distribution of the estimated regression

coefficients will become quite useful.

I However, we can only work with the sampling distribution if

we have a way to estimate the mean and variance of the

Normal.

I

The mean is easy… it’s simply our estimated regression

coefficients.

I

For the variance, the inverse matrix is easily calculated using

our data.

I

And we’ve already found an estimate of the population error

RSS

, and we can simply use this in place of σ 2

variance, s 2 = n−p−1

I We do need to be careful though because using s 2 gives us an

β , which means β̂

β will no longer be

estimated covariance of β̂

Normally distributed.

I

Instead, to account for added uncertainty from the estimate,

we will use a Tn−p−1 distribution (like in Week 0 Review

Slides).

25 / 46

Code-Along Session

In this Code-Along, we will see how to compute and extract the

assorted variance terms we have discussed. We will focus on:

I extracting estimated error variance,

I extracting standard errors of each beta estimate

I extracting full covariance matrix for beta estimates

26 / 46

Outline

Assumptions and Properties of Estimators

Assumptions for Linear Regression

Properties of Residuals

Sampling Distributions of the LS Estimators

Intervals and Hypothesis Tests

For the estimated coefficients and mean response

For an actual individual response

27 / 46

Confidence in Estimates

I Now that we can find the least squares estimates of the model

parameters, we need to determine how confident we are that

we have captured the true parameters.

I Recall that a confidence interval reflects how drawing a

different sample from the population will give different

estimates of the parameters.

I

It is a statement about the confidence we have in our sample.

I

e.g. a 95% CI represents the percentage of confidence intervals

created from other samples of the same size as ours that will

capture the true parameter value.

I Since we use a sample to estimate the parameters in the

regression line, we must have corresponding CIs to reflect the

margin of error of our estimates.

28 / 46

Creating Confidence Intervals (CIs) and Hypothesis Tests

I CIs and Hypothesis tests are constructed from the same

– truth

quantity, called a pivotal quantity: pivotal = estimator

standard error

I Both CIs and tests compare this pivotal quantity to the

sampling distribution of the estimator, namely

estimator ∼ N(truth, standard error2 )

I

CIs create a probabilistic statement that references the

likelihood of obtaining an estimate a specific distance from the

truth.

I

Hypothesis tests instead use the distribution to comment on

the likelihood that an estimated value could have arisen from

this distribution.

I For linear regression, since the standard error we work with is

an estimated value, the Normal distribution is not variable

enough to capture the estimation error of both β̂ and s 2 , so

we use the T distribution instead of a Normal.

29 / 46

CI and Test for individual βj

CI: estimate ± (critical value)(standard error)

truth

Test statistic: point estimate−possible

standard error

Quantity

βj

(1 − α)% interval

q

β j+1 ± t α2 ,n−p−1 s (X0 X)−1

β̂

(j+1,j+1)

Test Statistic

Distribution

β j+1 −βj0

β̂

r

s

(X0 X)−1

(j+1,j+1)

Tn−p−1

I α is the chosen significance level (often 0.05), while 1 − α is the

chosen confidence level (often 0.95).

I The degrees of freedom of the T distribution are the same as the

denominator of our estimate s 2 .

I Matrices begin their indexing at 1, not 0, so to extract the right

element corresponding to β̂j , you increase the index by 1.

I The same test statistic is used regardless of whether testing

Ha : βj 6= βj0 or Ha : βj > βj0 (or also < βj0 ).
30 / 46
Inference on individual βj
I When conducting a hypothesis test on βj , we can test any
hypothesized value for this parameter.
I
However, the default is to test whether βj = 0.
I
This reflects testing whether there is no linear relationship
between Xj and Y while holding other predictors fixed.
I We can also opt for one or two-sided tests, but the default is
two-sided because the alternative hypothesis to no
relationship is that a relationship exists.
I
Rejection of the null hypothesis can be determined using a
p-value (e.g. P(|t ∗ | ≥ t α2 ,n−p−1 ) < α if two-sided) or by
comparison to a critical value (e.g. |t ∗ | ≥ t α2 ,n−p−1 )
I As with the hypothesis test, when interpreting our CI, we
must also incorporate the notion that we are 95% confident
that this interval captures the true linear relationship between
Xj and Y in the presence of other fixed predictors.
31 / 46
Sampling distribution of mean response
I We can also perform inference on the mean response
\
β = E (Y
β.
E (Y | X), estimated by Ŷ = Xβ̂
| X), where ŷi = x0i β̂
I Similar to β , we would make inference on a single mean
response y0 = E (Y | X = x00 ) = x00β , rather than the entire
vector of all mean responses.
I
Here, x00 = (1, x1 , x2 , . . . , xp ) has a specific value for each
β.
predictor, and we estimate E (Y | X = x0 ) by ŷ0 = x00β̂
I The sampling distribution of ŷ0 = x0 β̂
0 β is
ŷ0 | X, x0 ∼ N(x00β , σ 2 x00 (X0 X)−1 x0 )
I
I
I
the estimator is unbiased
β | X, x0 ) = x00 E (β̂
β | X, x0 ) = x00β
E (ŷ0 | X, x0 ) = E (x00β̂
β | X, x0 ) =
and has variance Var (ŷ0 | X, x0 ) = Var (x00β̂
β | X, x0 )x0 = σ 2 x00 (X0 X)−1 x0
x00 Var (β̂
ŷ0 is a linear combination of Y which gives Normality.
32 / 46
CI and Test for mean response
CI: estimate ± (critical value)(standard error)
truth
Test statistic: point estimate−possible
standard error
Quantity
βj
ŷ0 = x00β
(1 − α)% interval
q
(X0 X)−1
(j+1,j+1)
β j+1 ± t α2 ,n−p−1 s
β̂
β ± t α2 ,n−p−1 s
x00β̂
p
x00 (X0 X)−1 x0
Test Statistic
Distribution
β j+1 −βj0
β̂
r
s
0
s
Tn−p−1
(X0 X)−1
(j+1,j+1)
0
0
√ x00β̂β −y
0 −1
x0 (X X)
Tn−p−1
x0
I Once again, we use T distribution for critical values as Normal only
works if σ 2 is known.
I Hypothesis tests for mean response are not very common, but can
be used for testing a specific value y00 .
I The simple regression version of Var (ŷ0 | x0 , X ) = σ 2
(x0 −x̄)
1
n + SXX
2
tells us that the variance will be larger at x0 that is far from x̄ (i.e.
predictions are better in the middle of the data).
33 / 46
Simple Linear Regression versions
I The matrix-based formulae will work for simple linear models
too, but sometimes it’s easier to compute these CIs and test
statistics in an algebra-based framework.
I The fundamental change is the expression of the variance.
I
We’ve just seen that the variance of the estimator
of the mean
(x0 −x̄)2
2 1
response is Var (ŷ0 | x0 , X ) = σ n + SXX
I
β in simple
The covariance matrix
Pn of 2β̂
regression can be written
−x̄
i =1 xi /n
σ2
β ) = SXX
as Cov (β̂
−x̄
1
I
So we have
2
σ
Var (β̂0 ) = SXX
P
n
2
i=1 xi
n
= σ2
x̄ 2
1
+ SXX
n
,
2
σ
Var β̂1 = SXX
I These can give us better insight into how the variance of our
estimators depends on a multitude of factors.
34 / 46
Poll Question 4
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
True of False: if my predictor is highly variable in simple
regression, the variance of my LS estimators will increase.
35 / 46
Exercise - Give it a try!
We have to following summary measures for the earlier data:
P20
P20
P20 2
i=1 xi = 4035
i=1 yi = 4041
i=1 ei = 4753.125
P20 2
P20
i=1 xi = 1005535
i=1 xi yi = 864910
If β̂1 = 0.259, find a 95% confidence interval for the mean
response at X = 200 (use t α2 ,18 = 2.10).
36 / 46
Outline
Assumptions and Properties of Estimators
Assumptions for Linear Regression
Properties of Residuals
Sampling Distributions of the LS Estimators
Intervals and Hypothesis Tests
For the estimated coefficients and mean response
For an actual individual response
37 / 46
Predicting a Future Observation
I There is a distinction between predicting the mean response in
the population and predicting the actual response of an
individual member of the population.
I The mean response at a specific value of the predictor is a
parameter, y0 = E (Y | X = x00 )
I
I
the expected value is what we would expect a response Y to
be in the long run when X = x00
so E (Y | X = x00 ) is a fixed but unknown quantity.
I The actual response for an individual with a specific value of
the predictor is a realization of the random variable, y0
I
Because y0 is a random variable, it can take any number of
values when X = x00
I
It also may not lie on the population regression line.
38 / 46
Predicting a Future Observation
I If we want an actual response y0 but we can only get an
estimate from the regression line ŷ0 , then our inference will
need to account for the distance that y0 is from the regression
line.
I Thus, we build a prediction interval to provide a range of
possible values for the future observation.
I The error in this prediction from using a regression line for
prediction is
y0 − ŷ0 = x00β + 0 − ŷ0 = (x00β − ŷ0 ) + 0
I This says that difference in the actual response and the one
we predict with our regression line is based on how well we
estimate the conditional mean (x00β − ŷ0 ) plus the natural
variation in the conditional distribution (0 )
39 / 46
Mean and Variance of Prediction Error
I In the population y0 = x0 β + 0 , so we can predict y0 by
0
β because this is all our regression line can provide.
ŷ0 = x00β̂
I Using the prediction error as written previously, we can
determine a distribution for the prediction error which we will
use to get our prediction interval.
I
We find E (y0 − ŷ0 | X, x0 ) = 0
I
We also know that y0 and ŷ0 are independent because the
observations that go into finding ŷ0 are sampled randomly
from the same population as y0 .
I
This let’s us find the variance in the prediction error:
β)
Var (y0 − ŷ0 | X, x0 ) = Var (x00β + 0 − x00β̂
β)
= Var (0 ) + Var (x00β̂
= σ 2 + σ 2 x00 (X0 X)−1 x0
= σ 2 [1 + x00 (X0 X)−1 x0 ]
40 / 46
Distribution of Prediction Error
I Now, we know the average prediction error and how variable
that prediction error will be.
I Thus, using the same arguments as before, we can obtain a
Normality result which says the prediction error is
y0 − ŷ0 | X, x0 ∼ N(0, σ 2 [1 + x00 (X0 X)−1 x0 ])
I Once again, in practice we do not know the value of σ 2 and
so would need to estimate it using s 2 .
I
Then the distribution of prediction errors is better described by
a Tn−p−1 distribution than a Normal.
I We create an interval similarly to confidence intervals (by
using the distribution to measure a certain number of
standard errors away from a centre).
I
But because we are not trying to estimate a parameter (we are
seeking an observed value), we cannot call this a confidence
interval as confidence specifically refers to parameters.
41 / 46
Prediction Interval for an actual response
CI: estimate ± (critical value)(standard error)
truth
Test statistic: point estimate−possible
standard error
Quantity
βj
y0 = x00β
y0p = x00β
(1 − α)% interval
q
β j+1 ± t α2 ,n−p−1 s (X0 X)−1
β̂
(j+1,j+1)
p
β ± t α2 ,n−p−1 s x00 (X0 X)−1 x0
x00β̂
p
β ± t α2 ,n−p−1 s 1 + x00 (X0 X)−1 x0
x00β̂
Test Statistic
β j+1 −βj0
β̂
r
s
(X0 X)−1
(j+1,j+1)
0
s
Distribution
Tn−p−1
0
0
√ x00β̂β −y
0 −1
x0 (X X)
NA
x0
Tn−p−1
Tn−p−1
I Note that y0p is used simply as a way to distinguish the interval for
the mean response and for an actual response.
I
they are equivalent values since the regression line can only
estimate a point on itself.
I The algebraic version for simple linear regression is
β̂0 + β̂1 x0 ± t α2 ,n−2 s
q
2
0 −x̄)
1 + n1 + (xSXX
.
42 / 46
Poll Question 5
I Blue lines
250
Production Time
200
150
I Red lines
100
Which interval is the confidence interval for the mean
response?
300
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
50
100
150
200
250
300
350
Order Size
43 / 46
Notes on the Prediction Interval
I You may have noticed that the prediction interval results in a
wider interval than a confidence interval for the conditional
mean response, even though they are centred at the same
value.
I We can see why this is the case looking at the formulae for
the variance.
I
We have variation due to estimating the conditional mean
response (i.e. σ 2 x00 (X0 X)−1 x0 )
I
But because we are predicting an actual observation, we also
have variation in the response distribution, because the
random variable could take any value from the Normal with
variance σ 2 I.
I Therefore prediction intervals are wider than confidence
intervals because they must capture 100(1 − α)% of the
response distribution to reflect the most likely 95% of
response values the random variable could take.
44 / 46
Code-Along Session
In this Code-Along, we will work through how to conduct these
inferential techniques in R. We will see how to:
I conduct hypothesis tests on each regression coefficient
I build confidence intervals on each regression coefficient
I build confidence intervals on the mean response given a set of
predictor values
I build prediction intervals for an actual observed response.
45 / 46
Wrapping Up
I We have derived a number of important inferential tools that
we will continue to use throughout the course:
I
We have hypothesis tests/CIs for determining whether a single
predictor is significantly linearly related to the response in the
presence of the other predictors.
I
We have hypothesis tests/CIs for determining whether a
certain conditional mean response parameter value is plausible.
I
We have a prediction interval that allows us to provide a range
of possible future values of an observed response.
I All of our results however rely heavily on the assumptions of
linear regression being satisfied.
I Next week, we will see how to use these and other inferential
tools to refine a regression model.
46 / 46
STA302/1001: Methods of Data Analysis 1
Instructor: Katherine Daignault
Department of Statistical Sciences
University of Toronto
Week 5 (Oct. 10-14)
1 / 44
Outline
Intervals and Inference
For an actual individual response (last week)
Decomposing the Variation in the Response
Sum of Squares Decomposition
Coefficient of Determination
ANOVA F Test
Partial F Test
2 / 44
Week 5 Learning Goals
In this week, we will see that regression models break down
variation in the response into two components: that which is
explained by the predictors and that which is not. We will develop
two tests for determining the significance of the linear relationship,
as well as how to quantify how much variation is explained by your
model. To that end, the learning goals are:
I apply the appropriate test and define appropriate hypotheses
for each test.
I correctly conclude tests for significance of the linear
relationship.
I describe how the tests compare sources of variation and how
this leads to our conclusions.
I explain the coefficient of determination and use it
appropriately.
3 / 44
Outline
Intervals and Inference
For an actual individual response (last week)
Decomposing the Variation in the Response
Sum of Squares Decomposition
Coefficient of Determination
ANOVA F Test
Partial F Test
4 / 44
Predicting a Future Observation
I There is a distinction between predicting the mean response in
the population and predicting the actual response of an
individual member of the population.
I The mean response at a specific value of the predictor is a
parameter, y0 = E (Y | X = x00 )
I
I
the expected value is what we would expect a response Y to
be in the long run when X = x00
so E (Y | X = x00 ) is a fixed but unknown quantity.
I The actual response for an individual with a specific value of
the predictor is a realization of the random variable, y0
I
Because y0 is a random variable, it can take any number of
values when X = x00
I
It also may not lie on the population regression line.
5 / 44
Predicting a Future Observation
I If we want an actual response y0 but we can only get an
estimate from the regression line ŷ0 , then our inference will
need to account for the distance that y0 is from the regression
line.
I Thus, we build a prediction interval to provide a range of
possible values for the future observation.
I The error in this prediction from using a regression line for
prediction is
y0 − ŷ0 = x00β + 0 − ŷ0 = (x00β − ŷ0 ) + 0
I This says that difference in the actual response and the one
we predict with our regression line is based on how well we
estimate the conditional mean (x00β − ŷ0 ) plus the natural
variation in the conditional distribution (0 )
6 / 44
Mean and Variance of Prediction Error
I In the population y0 = x0 β + 0 , so we can predict y0 by
0
β because this is all our regression line can provide.
ŷ0 = x00β̂
I Using the prediction error as written previously, we can
determine a distribution for the prediction error which we will
use to get our prediction interval.
I
We find E (y0 − ŷ0 | X, x0 ) = 0
I
We also know that y0 and ŷ0 are independent because the
observations that go into finding ŷ0 are sampled randomly
from the same population as y0 .
I
This let’s us find the variance in the prediction error:
β)
Var (y0 − ŷ0 | X, x0 ) = Var (x00β + 0 − x00β̂
β)
= Var (0 ) + Var (x00β̂
= σ 2 + σ 2 x00 (X0 X)−1 x0
= σ 2 [1 + x00 (X0 X)−1 x0 ]
7 / 44
Distribution of Prediction Error
I Now, we know the average prediction error and how variable
that prediction error will be.
I Thus, using the same arguments as before, we can obtain a
Normality result which says the prediction error is
y0 − ŷ0 | X, x0 ∼ N(0, σ 2 [1 + x00 (X0 X)−1 x0 ])
I Once again, in practice we do not know the value of σ 2 and
so would need to estimate it using s 2 .
I
Then the distribution of prediction errors is better described by
a Tn−p−1 distribution than a Normal.
I We create an interval similarly to confidence intervals (by
using the distribution to measure a certain number of
standard errors away from a centre).
I
But because we are not trying to estimate a parameter (we are
seeking an observed value), we cannot call this a confidence
interval as confidence specifically refers to parameters.
8 / 44
Prediction Interval for an actual response
CI: estimate ± (critical value)(standard error)
truth
Test statistic: point estimate−possible
standard error
Quantity
βj
y0 = x00β
y0p = x00β
(1 − α)% interval
q
β j+1 ± t α2 ,n−p−1 s (X0 X)−1
β̂
(j+1,j+1)
p
β ± t α2 ,n−p−1 s x00 (X0 X)−1 x0
x00β̂
p
β ± t α2 ,n−p−1 s 1 + x00 (X0 X)−1 x0
x00β̂
Test Statistic
β j+1 −βj0
β̂
r
s
(X0 X)−1
(j+1,j+1)
0
s
Distribution
Tn−p−1
0
0
√ x00β̂β −y
0 −1
x0 (X X)
NA
x0
Tn−p−1
Tn−p−1
I Note that y0p is used simply as a way to distinguish the interval for
the mean response and for an actual response.
I
they are equivalent values since the regression line can only
estimate a point on itself.
I The algebraic version for simple linear regression is
β̂0 + β̂1 x0 ± t α2 ,n−2 s
q
2
0 −x̄)
1 + n1 + (xSXX
.
9 / 44
Poll Question 1
I Blue lines
250
Production Time
200
150
I Red lines
100
Which interval is the confidence interval for the mean
response?
300
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
50
100
150
200
250
300
350
Order Size
10 / 44
Notes on the Prediction Interval
I You may have noticed that the prediction interval results in a
wider interval than a confidence interval for the conditional
mean response, even though they are centred at the same
value.
I We can see why this is the case looking at the formulae for
the variance.
I
We have variation due to estimating the conditional mean
response (i.e. σ 2 x00 (X0 X)−1 x0 )
I
But because we are predicting an actual observation, we also
have variation in the response distribution, because the
random variable could take any value from the Normal with
variance σ 2 I.
I Therefore prediction intervals are wider than confidence
intervals because they must capture 100(1 − α)% of the
response distribution to reflect the most likely 95% of
response values the random variable could take.
11 / 44
Code-Along Session
In this Code-Along, we will work through how to conduct these
inferential techniques in R. We will see how to:
I conduct hypothesis tests on each regression coefficient
I build confidence intervals on each regression coefficient
I build confidence intervals on the mean response given a set of
predictor values
I build prediction intervals for an actual observed response.
12 / 44
Outline
Intervals and Inference
For an actual individual response (last week)
Decomposing the Variation in the Response
Sum of Squares Decomposition
Coefficient of Determination
ANOVA F Test
Partial F Test
13 / 44
Regression Explains Variation in Response
I We have seen that a linear model is often fit because we are
trying to estimate a relationship between a response and some
number of predictors in the population.
I When working with a single predictor, we can talk about this
as trying to use X to explain the pattern that we observe in
our response Y .
I
This can also be thought of as using X to explain the variation
we observe in Y .
I Last module, we talked about testing individual coefficients to
determine if they are significantly linearly related to the
response in the presence of other predictors.
I
In simple regression, this is actually the same as testing
whether our single X significantly explains the
variation/pattern in Y .
I We can use this idea of explaining variation to create new
tests and summaries for our models.
14 / 44
Poll Question 2
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
Which graph displays data that would be more likely to yield
a rejection of the null hypothesis of no linear relationship?
I Graph A
I Graph B
15 / 44
Variation and the Regression Line
I Looking at these, one might intuitively think that the
regression line would be better at representing the linear
relationship when the relationship is more visually obvious.
I
In fact, the clearer relationship would indeed be more likely to
yield a significant t-test on the slope than the less clear
relationship.
I
This is because having more variation in the response means
there is more variability for the predictor to try to explain.
I
Therefore we may have more variation that is unexplained by
the predictor, i.e. larger residual sum of squares.
16 / 44
A Decomposition of Variation
I Let’s consider the linear regression model Y = Xβ
β + .
I This inherently is saying that the value of the response is
composed of two parts: the part explained by the values of
the predictors, and the random variability in the distribution.
I We can think of our sample and its variability in the same
way:
I
We have a certain amount of variation in our sampled
responses (we can determine this with a sample variance).
The regression line fit through our data can be used to say
that a certain amount of the variation in the responses is due
to this relationship.
I Lastly, we have the residuals that talk about how different
each data point is from the model (or the pattern described by
the model)
I
I
if we take the estimated error variance, this represents the
leftover variation in the response not explained by the model.
17 / 44
A Decomposition of Variation
I We can write out this relationship between various sources of
variation with equations.
I Variation will be expressed as sums of squares (like the RSS) -
sums of the squared deviations between two quantities.
I The original amount of variation we start with is our total
sum of squares (SST).
I
Pn
We express it as SST = i=1 (yi − ȳ )2 , or (n − 1)sy2 , where sy2
is the sample variance of the response.
I The residual amount of variation leftover after fitting a
regression model is the residual sum of squares (RSS).
I
Recall this is RSS =
Pn
2
i=1 êi =
Pn
i=1 (yi − ŷi )
2
I Lastly the variation explained by the model is the regression
sum of squares (SSreg).
I
Since no relationship between P
Y and X would be a horizontal
n
line at ȳ , we express SSreg = i=1 (ŷi − ȳ )2 .
18 / 44
Decomposition of the Sum Of Squares
I Putting these pieces together,
we get the sum of squares
decomposition:
n
n
n
X
X
X
(yi −ȳ )2 =
(ŷi −ȳ )2 + (yi −ŷi )2
i=1
i=1
i=1
or SST = SSreg + RSS.
I In the matrix framework, this can be written as
0 0
0
1
01
β X Y − Y JY + Y0 Y − β̂
β X0 Y
Y (I − J)Y = β̂
n
n
0
where J is a square matrix of ones (see Rencher Chapter 5.1
to see why SST is written like this).
I In the next section, we will see how to use this decomposition
to talk about how much response variation the regression
model explains.
19 / 44
A Numerical Example
Suppose we collect a sample of 20 observations on both a response
(Y) and a single predictor (X). We find that the mean response in
the sample is 202.05 while the sample variance in the response is
927.5237. A simple linear model is fit and the estimated error
variance is 264.1431. Find the components of the sum of squares
decomposition.
20 / 44
Visualizing with Venn/Euler Diagrams
21 / 44
Poll Question 3
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
If we wanted to measure the ”goodness of a model”, i.e.
how well the model explains the initial variation in the
response, what could we use?
I a hypothesis test on the slope
I the correlation between X and Y
I the estimated variance in the errors
I a ratio of regression sum of squares and total sum of squares
22 / 44
Outline
Intervals and Inference
For an actual individual response (last week)
Decomposing the Variation in the Response
Sum of Squares Decomposition
Coefficient of Determination
ANOVA F Test
Partial F Test
23 / 44
Quantifying Amount of Variation Explained
I We saw that fitting a regression model can also be interpreted
as explaining some of the variation observed in the response.
I We found that we can take the total variation (given by SST)
and partition/decompose it into two pieces:
I
The portion that the model/predictors explains (SSreg)
I
The portion that is leftover/unexplained (RSS)
I When fitting different models on the same sample, the SST
will be the same.
I However, consider two statisticians working on a similar
problem but on two different samples of data.
I
They both fit a model using two predictors, but they happen
to pick two different predictors.
I
While we could look at each model’s SSreg to see which model
explains more variation, it will be difficult to know who had the
better model because the SST will be different.
24 / 44
Coefficient of Determination, R 2
I The issue with strictly comparing the SSreg values is that the
data is changing.
I So we can “standardize” the SSreg by the SST so that the
value no longer depends on the original variation in the
responses.
I This gives us what is called the coefficient of determination
(R 2 ), given by
R2 =
RSS
SSreg
=1−
SST
SST
I The coefficient of determination has some nice characteristics:
It can also be computed by squaring the sample correlation
when working with a simple linear model
I It actually measures the proportion of the variation in the
response that is explained by the model.
I
25 / 44
Notes on Using the Coefficient of Determination
I The coefficient of determination is really just a description or
summary measure that can be used to help discuss the
performance of your model.
I
It is not a formal test so...

Don't use plagiarized sources. Get Your Custom Essay on

STA302 Final Project

Just from $13/Page

Why Work with Us

Top Quality and Well-Researched Papers

We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.

Professional and Experienced Academic Writers

We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.

Free Unlimited Revisions

If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.

Prompt Delivery and 100% Money-Back-Guarantee

All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.

Original & Confidential

We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.

24/7 Customer Support

Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.

Try it now!

How it works?

Follow these simple steps to get your paper done

Place your order

Fill in the order form and provide all details of your assignment.

Proceed with the payment

Choose the payment system that suits you most.

Receive the final file

Once your paper is ready, we will email it to you.

Our Services

No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.

Essays

No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.

Admissions

Admission Essays & Business Writing Help

An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.

Reviews

Editing Support

Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.

Reviews

Revision Support

If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.