STA302 Final Project

Final Project Part 3Final Data Analysis Report
Due: December 20, 2022 by 11:59PM ET on Quercus
No late submissions will be accepted
Goal of the Assessment:
Part 3 of the Final Project is your opportunity to demonstrate all that you have learned
throughout the course. This will be done by showing the teaching team that you can use the
methods and techniques learned in the course appropriately. You can use the feedback that
you have received in Part 1 and 2, as well as in the video project to write a report that is in a
common research paper format (IMRD: Introduction, Methods, Results, Discussion). Writing
these kinds of reports is likely something that, as a graduate student or a statistician working in
industry, you will find yourself doing occasionally.
Since this assignment is used to assess how familiar you are with the use of the tools and
methods from this course only, you should NOT use materials that were not covered in this
course. Instead, focus on showing us how much you know about everything we have discussed
throughout the term.
It can also be used as part of a dossier when applying to jobs to showcase your abilities as a
statistician and data analyst.
General Instructions:
Using only methods and techniques presented in the lecture slides throughout the term, you
are tasked with answering your proposed research question by creating the ‘best’ linear
regression model that meets the requirements of your research question. You will then need to
write a report (details below) that (i) introduces your research question and presents some
background, (ii) outlines the steps in your analysis that you followed to reach the ‘best’ model,
(iii) presents the results of your analysis and describes and justifies the decisions you made, and
finally (iv) discusses the final model, its interpretation and its limitations in terms of its ability to
meet your research goals. It should be made clear whether you are aiming for a model that
makes good predictions, or a model that is more descriptive and easier to interpret, or some
combination of both.
The feedback and work you have put into Part 1 of the final project should help you structure
your report in a professional and easy-to-read fashion, as well as provide you with a good
beginning to your introduction section. You may want to consider adding some additional
background research or more discussion about how your research question is important and
different from the background you present. The EDA portion of part 1 should be helpful in
writing the beginning of the results section, where you display the characteristics of the data
you will use to answer your question.
The feedback and work you have put into Part 2 of the final project should help you structure
the methods section of your report, where you will outline the process you followed/tools and
methods you used to answer your research question. The feedback should also help you with
how you approach your data analysis itself.
How to present your final report:
Once you have decided upon the ‘best’ model to fulfill the goal of the project, you must write
up a short scientific report. There should be 4 main sections of your report:




Introduction section: where you introduce the purpose and relevance/importance of
the project and provide some relevant background information on the topic (no results
or data should be presented here).
Methods section: where you describe and explain the methods, tools and techniques
used to arrive at your final model (no results or data should be presented here, but you
can tell us where you found your data and what variables it contains).
Results section: where you present a numerical/graphical description of your study
sample and important results that led you to make crucial decisions in building your
model (following the methods you outline in the earlier section), followed by the final
model and any other important results
Discussion section: where you interpret your final model and describe why it answers
the research question and why it is important, as well as discuss any limitations that still
exist based on your results.
You may use tables and plots to help present your results, but they must be relevant and wellthought-out to convey as much information as possible without being too overwhelming or
confusing. When explaining your methods, try to avoid just stating that you used a specific
method, but add an explanation for how it is used to achieve a specific task. When presenting
your results, avoid repeating exactly what you wrote in your methods section. Instead, focus on
the results of the process you described earlier, and use numerical values/graphical results to
support the decisions you made in arriving at your final model. See the rubric for more
information regarding the various report components.
If you want more information about how to structure your report and what should be
contained in each section, see this cheat sheet and this outline for reports (you may ignore the
abstract portion since you do not need one). Note that not all the elements in these resources
need to be included in your report. But you can use these to better understand how to
structure your submission.
Finally, if you use any external resources outside of the lecture slides, e.g. to provide
background on your topic, you should include a reference section at the end of your report. You
may follow APA citation styles to help format your references. For some resources on how to
cite, see the library page on citations.
What to do if you want to change your dataset or research question:
If you wish to change your dataset or research question from what was originally proposed in
Part 1, you are allowed to do so. However, you will need to provide a written statement that
proposes the change you wish to make. In order to change your dataset or research question,
you will need to submit a 1-page document (to be submitted by December 4 at 11:59PM ET on
Quercus) that answers the following two questions:
1. Why are you changing your topic or dataset? Elaborate on what made your original
dataset or topic not appropriate for the final project.
2. What makes your new topic and/or dataset more appropriate than the previous one?
Be sure to clearly state your new research question and provide a short, written
description of where you located your dataset and what information it contains.
The instructor will then approve or provide suggestions to improve your new dataset/research
question.
Technical Requirements of the Final Report:
Your report should be typed using whatever software you prefer but must be saved and
submitted as a PDF or .docx file on Quercus. Your report must meet the following
requirements:






Font: 12-point font in a style similar to Times New Roman (this is the default in R
Markdown)
Spacing: single-spaced
Word count: up to a maximum of 1500 words in total (this does not include captions on
figures and tables, however, you should also not make captions excessively long or
contain information that isn’t mentioned in the main text). We will still accept a report
that exceeds the word limit by no more than 150 words.
Number of tables/figures in the main report: 5 in total, but you may use any
combination of tables and figures
Figures and table captions: all figures and tables included should include a caption that
describes what is being presented (caption not included in the word count).
o Captions should not contain information that is not also discussed in the main
report
Figure properties:
o All plots should have an appropriate title and axis labels, avoiding the use of
variable names as they appear in the dataset
o A figure may include multiple individual plots but they should be related to each
other and make sense as to why they are being presented together
§ Avoid having too many plots in the same figure to ensure that they are
legible and clear.



Reference list or bibliography at the end of the report (will not count towards word
count), using appropriate citation style
Appendix: you may add an appendix at the end of your report to include some
additional tables or figures that were not important enough to be part of the main
report, but still relevant to your analysis:
o up to 3 additional tables/figures but they should only be included if they are
relevant to the analysis and are referred to in the main text.
R code: In a separate file (i.e. RMD file), you should upload your cleaned and complete
version of the R code that was used to conduct your analysis. The R code should be wellorganized and commented appropriately to indicate what each line/section of code is
doing.
Checklist for submitting final project part 3:
1. Your final written report which follows the requirements above.
2. Your R code that shows your complete analysis (this will be used to verify the results
displayed in your written report and will not be assessed for content).
Things to keep in mind while writing your final report:
o You do not need to write out the results of every step you took in your analysis as this
will make your report too long.
o Instead, focus on summarizing the most important results, especially where a big
decision was made. You need to justify it any big decisions.
o For the rest of your results, very short mentions of the process with a brief piece
of evidence provided are enough to allow your reader to follow your analysis and
understand how you arrived at the final model.
o Rather than presenting the results of each step separately (e.g creating separate tables
for each), consider putting together one larger table that you can refer to in your
discussion of many steps in your analysis so that you don’t use too much space
o For example, if you are selecting between a few different models, you could
consider presenting a table that includes many different summaries of the fit of
each model and refer to each part as needed in the text, instead of making
individual tables for each component.
o Avoid using R output taken directly from R/RStudio. Instead create your own tables
where you select only the relevant pieces of the output to display.
o Generally, the methods and results sections tend to be the longest sections, while the
introduction and discussion tend to be shorter.
o Keep this in mind when deciding how much background to provide in your
introduction. Often just a paragraph or two is plenty, given the word limits in this
project.
o However, make sure you leave yourself enough space for a solid discussion
where you can discuss the impact of the limitations that may exist in your model.
Excellent (3 points)
Report Characteristic
Introduction
Section
Methods
Section
Introduction of the
study
Variable Selection
Satisfactory (2
points)
Needs Improvement
or Meets
Completion
Requirement (1
point)
The goal of the study The goal of the study The goal of the study
is clear AND an
is not quite clear OR is not clear AND an
explicit explanation
an explicit
explicit explanation
of how this study
explanation of how
of how this study
differs or agrees
this study differs or
differs or agrees
with existing
agrees with existing with existing
literature is
literature is not
literature is not
provided.
provided.
provided.
The statistical tools
The statistical tools
The statistical tools
proposed to find a
proposed to find a
proposed to find a
final model are
final model are not
final model are not
described correctly
described correctly
described correctly
AND when they will
AND/OR when they
AND when they will
be used in the
will be used in the
be used in the
analysis is explained analysis is not
analysis is not
AND how
explained clearly
explained clearly
conclusions are
AND/OR how
AND how
made from these
conclusions are
conclusions are
tools is correctly
made from these
made from these
mentioned.
tools is not correctly tools is not correctly
mentioned.
mentioned.
Missing or Does not
Meet Completion
Requirement (0
points)
The introduction
section is not
included.
The variable
selection section is
not included.
Model Validation
Model Violations
and Diagnostics
Results
Section
Description of Data
How the model will
be validated is
clearly explained
with sufficient
details AND the
method proposed is
appropriate
How and when
model violations and
all diagnostics will be
performed is clearly
and correctly stated
AND how each will
be handled is
explained clearly and
correctly
Numerical/Visual
summaries of each
variable are
presented AND
important features
of the data are
discussed correctly.
How the model will
be validated is
mentioned AND the
method proposed is
appropriate but
needs more details
How the model will
be validated is very
unclear OR has many
details missing OR
the method
proposed is not
appropriate.
How and when
How and when
model violations and model violations and
all diagnostics will be all diagnostics will be
performed is not
performed is either
either clearly or
not clearly or
correctly stated OR
correctly stated AND
how each will be
how each will be
handled is either not handled is either not
explained clearly or
clearly explained or
not correct
not correct.
Numerical/Visual
Numerical/Visual
summaries of each
summaries of each
variable are not
variable are not
presented OR
presented AND
important features
important features
of the data are not
of the data are not
discussed or are
discussed or are
incorrect.
incorrect.
The model
validation section is
not included.
The model violation
and diagnostic
section is not
included.
The description of
the data section is
not included.
Presenting the
Analysis Process and
the Results
Goodness of the
Final Model
Sufficient detail is
provided to clearly
understand the
process taken to
arrive at final model
AND the process is
correct AND the
evidence presented
supports decisions
made
The final model has
been validated
correctly AND has
had model
assumptions verified
(and appropriately
corrected if
applicable) AND all
appropriate model
diagnostics have
been performed
Insufficient detail is
provided to clearly
understand the
process taken to
arrive at final model
OR the process is not
entirely correct OR
the evidence
presented does not
always support
decisions made or
evidence is lacking.
Insufficient detail is
provided to clearly
understand the
process taken to
arrive at final model
AND/OR the process
is not entirely
correct AND/OR the
evidence presented
often does not
support decisions
made or evidence is
lacking.
The final model has
The final model has
not been validated
not been validated
(or has been
(or has been
incorrectly
incorrectly
validated) OR has
validated) AND/OR
not had model
has not had model
assumptions verified assumptions verified
(or not appropriately (OR not
corrected if
appropriately
applicable) OR all
corrected if
appropriate model
applicable) AND/OR
diagnostics have not all appropriate
been performed
model diagnostics
have not been
performed
The presentation of
the analysis process
and results section
is not included.
The goodness of the
final model section
is not included.
Final Model
Interpretation and
Importance
Discussion
Section
Limitations of the
Analysis
An interpretation (in
context and correct)
is provided for at
least one coefficient
in the final model
AND a general
summary of what
the model tells us
about the
relationship
between predictors
and response is
provided AND it is
emphasized how the
final model answers
the research
question.
All lingering
problems with the
final model are
correctly mentioned
AND their potential
impact on usefulness
of final model
correctly discussed
AND a correct
justification is
provided for
why/how they could
not be corrected.
No coefficient in the
final model has been
correctly interpreted
in context OR a
general summary of
what the model tells
us about the
relationship
between predictors
and response is not
provided OR it is not
emphasized how the
final model answers
the research
question.
No coefficient in the
final model has been
correctly interpreted
in context AND/OR a
general summary of
what the model tells
us about the
relationship
between predictors
and response is not
provided AND/OR it
is not emphasized
how the final model
answers the
research question.
The final model
interpretation and
importance section
is not included
Some lingering
problems with final
model are correctly
mentioned OR their
potential impact on
usefulness of final
model not correctly
discussed OR a
correct justification
is not provided for
why/how they could
not be corrected.
Few of the lingering The limitations of
problems with final
the analysis section
model are correctly
is not included.
mentioned AND/OR
their potential
impact on usefulness
of final model not
correctly discussed
AND/OR a correct
justification is not
provided for
why/how they could
not be corrected.
Clarity and Length
General
Report
Quality
Use of Plots and
Tables
The report meets
word count AND is
written with very
few grammatical or
spelling mistakes
AND the report is
well structured with
appropriate sections
AND meets all
technical
requirements for the
report.
The report does not
satisfy at most 1 of
the following: meets
the word count
AND/OR is written
with few
grammatical or
spelling mistakes
AND/OR the report
well-structured
AND/OR the report
meets all technical
requirements
Plots/tables in the
Plots/tables in the
main text are clear
main text are a bit
and relevant for the unclear or are not
analysis AND
very relevant for the
plots/tables in the
analysis OR some
appendix are
plots/tables in the
referred to in the
appendix are not
main text and are
referred to in the
useful to the report
main text or are not
AND all plots/tables useful to the report
are correctly labelled OR not all
and captioned and
plots/tables are
have meaningful
labelled and
titles and axis labels. captioned correctly
or lack meaningful
titles and axis labels
The report does not
satisfy at most 2 of
the following: meets
the word count
AND/OR is written
with few
grammatical or
spelling mistakes
AND/OR the report
well-structured
AND/OR the report
meets all technical
requirements
Plots/tables in the
main text are not
clear and/or are not
very relevant for the
analysis AND/OR
some plots/tables in
the appendix are not
referred to in the
main text and/or are
not useful to the
report AND/OR all
plots/tables are not
labelled and
captioned correctly
or lack meaningful
titles and axis labels.
The report does not
satisfy 3 or more of
the following:
meets the word
count AND/OR is
written with few
grammatical or
spelling mistakes
AND/OR the report
well-structured
AND/OR the report
meets all technical
requirements
There are no plots
and tables used.
Meets Submission Requirements


R code is provided
AND final report is
submitted in the
correct format.
R code is not
provided OR final
report is not
submitted in the
correct format.
IMRD Cheat Sheet
Abstract
Abstracts can vary in length from one paragraph to several pages, but they follow the IMRaD format and
typically spend:
• 25% of their space on importance of research (Introduction)
• 25% of their space on what you did (Methods)
• 35% of their space on what you found: this is the most important part of the abstract (Results)
• 15% of their space on the implications of the research (Discussion)
I
ntroduction & Importance (Make a case for your new research)
Begin by explaining to your readers what problem you researched and why the research is necessary.
Convince readers that it is important that they continue to read.
Discuss the current state of research in your field, expose a “gap” or problem in the field, and then explain why your present research is a timely and necessary solution to that gap. See Novelty Handout.
M
ethods (What did you do?)
Methods are usually written in past tense and passive voice with lots of headings and subheadings.
This is the least-read section of an IMRaD report.
R
esults (What did you find?)
Results are where the findings and outcomes of the research go. When talking about this data, we
can think of the results as having two parts: report and comment. The reporting function always appears in the results section while the comment function can go in the discussion section. Make sure all
tables and figures are labeled and numbered separately. Captions go above tables and beneath figures.
(See Example on Page 3)
Report
Comment
D
1. Refer to your table or figure and state the main trend
Table 3 shows that Spam Filter A correctly filtered more junk emails than Filter B
2. Support this trend with data
Filter A correctly filtered…
The average difference is…
3. (If needed) Note any additional, secondary trends and support them with data
In addition… Figure 1 also shows…
4. (If needed) Note any exceptions to your main trends or unexpected outcomes
However…
5. (If needed) Provide an explanation
A feasible explanation is….
This trend can be explained by…
6. (If needed) Compare to other research
X is consistent with X’s finding…
In contrast, Y found…
7. (If needed) Evaluate whether the findings support or contradict a hypothesis
8. State the bottom line: what does the data mean?
These findings overall suggest…
These data indicate…
iscussion (What does it mean?)
Discussion sections contain the following moves:
1. They summarize the main findings of the study. This allows readers to skip to the beginning of the
discussion section and understand the main “news” in the report.
2. They connect these findings to other research
3. They discuss flaws in the current study.
4. They use these flaws as reasons to suggest additional, future research.
5. (If needed) They state the implications of their findings for future policy or practice.
Examples
Abstract




25% (Introduction)
25% (Methods)
35% (Results)
15% (Discussion)
This experiment tests the effect of choke type and gun selection on target accuracy in order to
determine the best gun specifications. Three competent shooters of approximately equivalent
marksmanship abilities tested three different choke types (full, modified, and improved) and two
different guns (a Remington 11-87 semi-automatic and a Beretta 682 Gold E). With a confidence
level of 95%, the gun selection ended up to be the only significant factor. The Beretta was
found more accurate than the Remington possibly because the Beretta’s weight is centered in
the middle of the gun while the Remington is a little barrel-heavy. However, if the confidence
level is lowered to 90%, choke type is also significant, with the improved choke more accurate
than the modified or full. Thus, for target shooting, the most accurate combination would be the
Beretta with an improved choke.
Introduction
Methods
Results
Discussion
Introduction
Bioplastics are manufactured from renewable biomass sources rather than petroleum and other fossil fuels.1 Bioplastics may be a sustainable alternative to petroleum plastics because they use fewer fossil fuels in production and
reduce greenhouse gas emissions as they biodegrade1a. Most bioplastics are currently made from starch-based
plastics or starch-polyester blends.1b However, polylactic acid (PLA), a thermoplastic aliphatic polyester typically
derived from corn starch, tapioca or sugarcane, may become a more commercially viable option.3 PLA resembles
traditional plastic, making it acceptable to consumers, and is able to be processed on equipment already used for
petroleum plastics. PLA has been used for biodegradable medical implants, packing materials, diapers and 3D
printers. However, although PLA biodegrades under carefully controlled conditions, it is not yet compostable except
in industrial composting facilities and cannot be mixed with other recyclable materials. This limits the commercial
viability of PLA because the infrastructure to transport bioplastic waste to appropriate composting facilities has not
yet been developed.2 A device that composts PLA and other bioplastics within a home composting environment
would make PLA a more viable commercial option.3
Methods1
Sb-Doped SnS Thin Film.
Pure, stoichiometric, single-phase SnS thin films can be obtained by atomic layer deposition (ALD) from the reaction of bis(N,N’-diisopropylacetamidinato)tin(II) [Sn(MeC(NiPr)2)2, referred here as Sn(amd)2] and hydrogen sulfide
(H2S).3 Rather than using ALD as previously reported,3 SnS thin films were deposited using a modified chemical
vapor deposition (CVD) process, referred here as a pulsed-CVD, to speed up the deposit rate to ~15 times higher
than that of ALD…
Material Characterization.
Film morphology was characterized using field-emission scanning electron microscopy (FESEM, Zeiss, Ultra-55).
The film thickness was determined from cross-sectional SEM. The elemental composition of the films was determined by Rutherford backscattering spectroscopy (RBS, Ionex 1.7 MV Tandetron) and time-of-flight secondary ion
mass spectroscopy (ToF-SIMS)…
1 Sinsermsuksakul, Prasert, Rupak Chakraborty, Sank Bok Kim, Steven M. Heald, Tonio Buonassisi, and Roy G. Gordon.
“Antimony-Doped Tin (III) Sulfide Thin Films.” Chemistry of Materials. 2012 (24). 4556-4562. Web. ACS Publications. 21
Oct., 2013.
Results
A.
Table 3 shows that Spam Filter A correctly filtered more junk emails than Filter B.1 Filter A correctly filtered 88% of
junk emails whereas filter B only filtered 63% correctly.2 However, Filter A takes longer to run than Filter B.4 This
increased run time is due to the type of programming language used in Filter A.5 These findings overall suggest that
Spam Filter A is a better filter than Filter B even though it takes longer to run.8
B.
Fig. 3 shows that the electrical conductivity of the Cu-doped ZnO is much lower than that of the undoped ZnO.1 The
electrical conductivity of even the 100 ppm Cu-doped ZnO specimen was about 3 orders of magnitude lower than
that of the undoped ZnO.2 As the doped Cu content increased, the electrical conductivity gradually decreased.3 As
a result, the 1000 ppm Cu-doped ZnO had the electrical conductivity 5 orders of magnitude lower than that of the
undoped ZnO.8
Discussion
The data collected from this small study suggests that verbal instructions are not needed to
complete a simple assembly task and may even interfere with the task. The participants who
received words plus pictures made more errors, took longer to complete the task, and were less
confident that they had completed the task correctly than participants who received pictures
alone. One reason for this finding may be the simplicity of the task since none of the guidelines
we examined suggest that textual information would interfere with visual instructions.
Summarize results
Our study is hampered by the small number and homogeneity of our participants. All of our
participants were college students and this may have affected our results. Additional research
might examine whether older participants would benefit from verbal instructions accompanying
pictures. More research is also needed examining different tasks. Our study involved a highly
physical task (constructing a lego vehicle). Future research should examine how pictures and
verbal instructions might interact on a more conceptual task, such as installing and using a
software program.
Flaws
Based on this limited analysis, we recommend that instruction writers consider excluding verbal
instructions on a simple assembly task. Our results indicate that verbal instructions may in
some cases interfere with users’ abilities to follow pictorial directions.
Explain results
Future research
Implications
Lab Reports – IMRAD  
 
 
 
The  purpose  of  a  lab  report  is  to  describe  the  results  of  an  experiment  or  research  study.  
University  lab  reports  follow  the  style  and  format  of  professional  journal  articles,  which  
research  scientists  use  to  share  and  evaluate  each  other’s  work.  
 
Lab  report  formats  vary  slightly  among  scientific  disciplines,  but  all  are  based  on  the  
IMRAD  outline:  introduction,  materials  and  methods,  results,  and  discussion.  The  purpose  
of  each  section  dictates  what  information  to  include,  regardless  of  the  specialty  being  
written  for.  
 
Helpful  Tip:  It  is  usually  easiest  to  write  the  methods  and  results  sections  first,  followed  by  
the  discussion  and  introduction.  Title  and  abstract  (if  required)  should  be  written  last.  
 
IMRAD  format:  
 
Section  
Purpose  
Content  and  Characteristics  
Title  
• Describes  the  content  of  
• Clear,  specific,  and  accurate  
the  report  
• Loaded  with  keywords  drawn  from  
• Allows  scientists  to  
the  body  of  the  report  
locate  research  of  
interest  when  searching  
databases  
Abstract  
• Summarizes  the  report    
• One  paragraph  (200-­‐250  words)  
• Helps  researchers  decide   • 2-­‐3  sentences  for  each  section,  
whether  to  read  the  
summarizing  key  data  and  ideas  
entire  paper  
• A  complete  synopsis,  not  a  teaser  
(results  and  discussion  must  be  
included)  
Introduction  
• Gives  background  
• Reviews  relevant  literature,  
information  needed  to  
including  properly  formatted  
understand  the  current  
citations  
research,  tracing  the  
• Explains  why  the  study  was  
development  of  existing  
conducted,  and  what  question  it  was  
knowledge  
designed  to  answer  
• Places  the  new  
• Briefly  describes  approach  to  the  
experiments  within  the  
problem  
context  of  the  field  
• Outlines  hypothesis(es)  to  be  tested,  
• Identifies  gaps  in  
and  predicted  results  
existing  knowledge  and  
• Written  in  a  mixture  of  present  tense  
shows  how  the  present  
(for  generally  accepted  truths)  and  
research  will  fill  them  
past  tense  (when  referencing  specific  
• States  the  specific  
research  
objectives  of  the  work  
 
 
 
 
©  The  Writing  Centre,  Saint  Mary’s  University,  2014  
This  handout  is  for  personal  use  only.  Reproduction  prohibited  without  permission.  
 
Lab Reports – IMRAD  
 
2  
IMRAD  format  continued:  
 
Section  
Purpose  
Content  and  Characteristics  
Materials  and  
• Explains  how  the  
• written  in  paragraph  format  
Methods  
experiments  were  
• materials  are  mentioned  while  
conducted  
describing  methods,  never  listed  
separately  
• Provides  enough  detail  
that  another  scientist  
• describes  the  purpose  of  each  
could  repeat  the  
procedure,  as  well  as  necessary  steps  
experiment  
• omits  details  that  are  common  
• Gives  readers  the  
knowledge  or  would  not  impact  the  
information  they  need  to  
results  
evaluate  the  validity  of  
• written  in  past  tense  (recounts  what  
results  and  conclusions  
was  done,  rather  than  giving  
instructions)  
Results  
• Describes  the  outcomes  
• straightforward  reporting  of  
of  the  experiments  
observations  and  calculations  
• Draws  attention  to  key  
• does  not  include  commentary  or  
findings  and  
interpretation  
relationships  
• detailed  data  is  presented  in  tables  
• Allows  readers  to  form  
and  figures,  which  are  referenced  in  
their  own  conclusions  
the  text  
based  on  the  data  
• written  portion  should  summarize  
and  emphasize,  not  repeat  details  
shown  in  the  visuals  
• written  in  past  tense  
Discussion  
• Interprets  the  results  
• references  key  data,  describing  its  
and  explains  their  
implications  
significance  
• identifies  any  errors  made  during  the  
experiment  and  their  impacts  
• Places  the  new  data  in  
the  context  of  the  field  
• discusses  any  shortcomings  of  the  
protocols  or  experimental  designs  
• Identifies  limitations  of  
the  study  and  suggests  
• draws  conclusions  
next  steps  
• identifies  questions  that  could  not  be  
 
answered  
• cites  relevant  literature  
• written  in  past,  present,  and  future  
tense,  as  appropriate  
References  
• Provides  full  
• includes  only  literature  that’s  cited  in  
bibliographic  
the  text  
information,  directing  
• follow  a  consistent  scientific  citation  
the  reader  to  relevant  
style,  such  as  APA  
literature  
 
©  The  Writing  Centre,  Saint  Mary’s  University,  2014  
This  handout  is  for  personal  use  only.  Reproduction  prohibited  without  permission.  
L egend:
Legend
Import data that contains
all possible pedictors into R
Week 2
Starting or end points
Week 3
Action / Process to apply
Week 4
Week 5
Check if the
variables are incorrect type or
there is any missing data
Week 6
Week 7
Decision to make
Recode and fix the
variables or remove
any missing data
Yes
Links to two halves of the
chart
Week 8
The arrows connects steps
No
Start with the full linear
model that consists of all
possible predictors
Randomly split data into two sets:
training set (70%) and test set (30%)
Use the training set to draw
scatter plots of the data
Fit linear model based on
observations of scatter plots
Interpret
violation
Check additional 2 conditions
and build residual plots
Perform hypothesis test for
coeffients of predictors in the
model, and biuld a new model
with all predictors that have
significant F-values in tests
No
Is there
any violation of
condition 1,2 or linear model
assumption?
Yes
Yes
Check constant
variance
Check
Normality
Yes
No
Fit reduced model and
perform partial F test
between reduced model and
full model
Yes
Apply (Boxcox) transformation to
both reponse and predictors, and
biuld plots for additional
conditions
If all
violations fixed after
transformations
Does the testing
result prefer the reduced
model?
Yes
No
Add back a removed predictor
that would increase the
R-squared most /decrease
BIC/AIC most
Yes
Check
Linearity
No
Identify the limitations caused
by violated assumptions
No
No
Identify all problematic
observations, including leverage
points, outliers, and influential
points
Is there any
valid reason to remove
some problematic
observations
Yes
Remove the problematic
observations and refit the
full model
No
Again, Check additional 2
conditions and biuld residual
plot for final model
Is there
any violations in
condition 1,2 or linear model
assumptions?
Yes
Do the violations
also appear in full model as we
already identified?
No
Interpret the parameters in
the final linear regression
model
Yes
Biuld up confidence interval for
average response prediction
interval for actual response, then
use the data in test set to fit for
the model
Compare the
result of training model
and test model, and see if
they are similar.
Yes
Make conclusion of
the research question based on
previous findings, and state the
limitations.
No
The model is
overfitting, discuss
limitations
No
Identify the limitations
caused by the new
appeared violations
STA302/1001: Methods of Data Analysis 1
Instructor: Katherine Daignault
Department of Statistical Sciences
University of Toronto
Week 3 (Sept. 26-30)
1 / 40
Outline
The Linear Regression Model
Modelling Conditional Means
Least Squares Estimation
Interpreting the Parameters
Introducing the Assumptions
2 / 40
Week 3 Learning Goals
In this module, we will be introduced to the linear regression
model. We will learn about how we use data to estimate our linear
relationship, and how to interpret the values we get, as well as how
different predictors yield different interpretations. We will also
introduce the assumptions needed as well as see how to create
linear models in R. To that end, the learning goals are
I to explain why regression models conditional relationships
I to apply the least squares procedure to different settings
I to estimate the parameters of a regression relationship
I to interpret the components of a regression model in the
context of a dataset
I to recognize that regression has assumptions and to
preliminary inspect them through EDA
3 / 40
Outline
The Linear Regression Model
Modelling Conditional Means
Least Squares Estimation
Interpreting the Parameters
Introducing the Assumptions
4 / 40
The Functional Component of the Relationship
I A linear regression model we saw is a statistical relationship
that defines a functional relationship between the predictor(s)
and the response, along with some random deviations.
I But in our Code-Along demo, we also discovered that while it
may not be possible to define a functional relationship for all
data points, it may be possible to do so for E (Y | X = x).
I The functional part of our statistical relationship does exactly
this!
I
We actually have that E (Y | X = x) = β0 + β1 xi
I
This says that as the value of xi increases by one unit, the
average response will change by β1 .
I So why is this the case? Let’s think about this in terms of
distributions and random variables.
5 / 40
Conditional Distributions of Responses
I In regression, we consider our predictor(s) to be fixed values,
i.e. not random variables.
I But the response value we might observe for a certain value of
the predictor is random, and thus Yi | X = xi ∼ f (y | xi ) with
a mean E (Y | X = xi ) and some variance Var (Y | X = xi ).
I The distribution tells us that the possible y values lie some
distance from these means.
I So for all responses that correspond to the predictor value xi ,
they will sit a random distance from the mean of the
distribution, which we can label i .
I Therefore, we can write Y = E (Y | X = xi ) + i , and if the
means change systematically as X changes, then
Y = β0 + β1 xi + i = E (Y | X = xi ) + i
6 / 40
The Population Relationship
I When dealing with one predictor, the relationship can be
viewed nicely.
I Even with many predictors, the results holds:
β +
Y = E (Y | X) +  = Xβ
I This is the relationship that occurs in the population, and we
cannot know what E (Y | X) or β actually are.
I So we will be required to use a sample from this population to
estimate these quantities.
7 / 40
The Sample Relationship
I In our sample, we’re going to want to impose the same
statistical relationship that we think is present in the
population.
I Using our sample data, we can write out a similar relationship
between the response and predictors, Y = Xb + ê, where
I
Y and X is our observed response and predictor data
I
b is some vector of coefficients representing possible slopes
and intercept to be estimated with the data
I
ê is the observed error in the data, called residuals.
I Note that b is just an arbitrary set of coefficients and does
not yet correspond to estimates of β .
I
The reason is that we don’t yet know how to estimate β .
I But once we get our β̂
β , then the linear relationship estimated
\
β
from the data will be Ŷ = E (Y
| X) = Xβ̂
I
i.e. we can estimate the conditional means in our population.
8 / 40
Outline
The Linear Regression Model
Modelling Conditional Means
Least Squares Estimation
Interpreting the Parameters
Introducing the Assumptions
9 / 40
Poll Question 1
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
How much familiarity do you have with estimation
procedures?
I I know what estimation is
I I know how maximum likelihood estimation works
I I know how maximum likelihood and least squares estimation
works
10 / 40
Residuals: a measure of distance
I Residuals, the observed errors e, will play an important role in
finding estimates for β .
I In the population, the errors  represent the distance
β.
 = Y − E (Y | X) = Y − Xβ
I So the residuals would be estimates of that same distance
based on the data.
I The issue is we don’t know where the regression line of best
fit should be, so how do we use the residuals to estimate this
exact relationship?
11 / 40
Line of Best Fit and Residuals
I To estimate these unknown β parameters that define the
population-level relationship between X and E (Y | X), we will
need to find a line of best fit in our data.
I ‘Best fit’ in this case will mean a line that sits as close as
possible to all observed responses.
I So that means we will need to find values for the elements of
b that minimize the distance of all observations to this line.
I
In simple regression, we want to find values b0 and b1 that will
ensure the estimated line ŷi = b0 + b1 xi lies as close as
possible to all observed yi
I
We call ŷi the predicted/fitted value of yi , i.e. Ŷ is an
estimate of E (Y | X)
I The residuals naturally give us a measure of closeness to the
line, since êi = yi − ŷi (equivalently in vector form:
ê = Y − Ŷ)
12 / 40
Minimizing the Residual Sum of Squares
I So we want to find the values b0 and b1 that fit the line as
close as possible to all points.
I This can be seen as ultimately wanting to make all residuals
as small as possible.
I But it’s not practical to minimize each individual êi – rather it
makes more sense to find a single equation to minimize that
incorporates the idea of the total distance of all points from
the line.
I To do this, we define the residual sum of squares (RSS) to be
this function we will minimize:
I
Residuals can be both positive or negative so we can square
them so they don’t cancel each other out.
I
Then we can sum them all up to give us the total squared
amount of variation between the points and the line:
RSS =
n
X
êi2 = ê0 ê
i=1
13 / 40
Poll Question 2
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
Suppose we have a point in a three-dimensional space and
we want to project this point to a two-dimensional plane.
What will be the angle of the vector connecting the point to
its projection on the plane?
I 90 degrees
I less than 90 degrees
I more than 90 degrees
14 / 40
Geometry of Least Squares
I But why do we square the residuals instead of e.g. taking the
absolute value?
I It has to do with the geometry of the vectors and spaces we
are working with.
1h
M y is
a
vector
is our
response vector
is our
error
vector
the model space
and has dimension equal
to the number of linearly
columns in X
is called
independent
the
representing
line
regression
I The way to minimize the error vector (i.e. make it as small as
possible) is to make it perpendicular.
I Once perpendicular, we have a right angle triangle and we can
find the lengths of the vectors using Pythagoras (or Euclidean
distances)
I This requires working with squares of the vectors.
15 / 40
How does the Least Squares Process Work?
I When dealing with one predictor or multiple predictors, the
process behind finding the values of b that minimize the
residual sum of squares is the same.
1. Take partial derivatives of the RSS (your estimating equation)
with respect to each term in β .
2. Set your result (the score equation) to 0.
3. Solve for the unknown parameters by re-arranging your
expressions.
I Once we have the actual estimates, we use the more familiar
β , instead of our
way to denote an estimate of β which is β̂
placeholder values b.
16 / 40
Least Squares Estimators (Simple and Matrix-based)
17 / 40
Notes on the LS Estimators
I The algebraic (simple) version of the estimators can only be
used when estimating the relationship between the response
and one predictor.
I Since the estimator for the intercept β̂0 contains the estimate
of the slope β̂1 , you’ll need to compute the slope first.
I The LS estimator for the slope also has an alternative form:
Pn
(x − x̄)(yi − ȳ )
Pn i
β̂1 = i=1
2
i=1 (xi − x̄)
I
the denominator of this form is the sum of squared deviations
between xi and its sample mean x̄, or SXX
I
I
it’s related to the sample variance of the predictor.
The numerator is similar to the idea of covariance – looking at
the product of deviations of each variable from its mean
(sometimes labelled SXY).
18 / 40
Notes on the LS Estimators
I For multiple predictors, you’ll likely not have to work with the
individual data matrices as they are large and cumbersome.
I However, some of the component matrices in the LS estimator
can be calculated easily:

n
P x
P i1
 x
i2
X0 X = 
 .
 .
P.
xip
P
P xi12
P xi1
xi1 xi2
..
P .
xi1 xip
P
P xi2
Pxi1 x2 i2
xi2
..
P .
xi2 xip





P
x
ip
P

P xi1 xip 
xi2 xip 
,

..

.
P 2 
xip
 P 
yi
P x y 
P i1 i 
 x y
i2 i 
X0 Y = 
 . 
 . 
P . 
xip yi
I To invert the X0 X matrix, you’ll likely need the aid of software
or would be given the inverse directly.
I Expanding the regression relationship
β = X(X0 X)−1 X0 Y, we can see that H = X(X0 X)−1 X0
Ŷ = Xβ̂
(the hat matrix) projects Y onto Ŷ and thus has all the
properties of a projection matrix (exercise: check this)
19 / 40
Exercise – Give it a try!
Suppose you have the following numerical summaries for 2
predictors and a response variable on 21 individuals. Estimate the
coefficients for this regression surface.
21
X
xi1 = 1302.4
i=1
21
X
21
X
xi2 = 360
i=1
xi22 = 6190.26
i=1
21
X
21
X
i=1
yi = 3820
i=1

xi1 xi2 = 22609.19
29.729

Use (X0 X)−1 =  0.072
−1.993
xi12 = 87707.94
i=1
21
X
xi1 yi = 249643.35
i=1
0.072
0.0004
−0.0056
21
X
21
X
xi2 yi = 66072.75
i=1

−1.993

−0.0056
0.136
20 / 40
Code-Along Session
I We will now jump into JupyterHub (jupyter.utoronto.ca)
and look into how we can estimate a linear regression
relationship on our NYC dataset. We will be doing the
following:
I
creating bivariate plots to visualize pairwise relationships
I
use the lm() function to estimate a simple and multiple linear
regression relationship
I
view and extract model estimates.
I Add the materials to JupyterHub either by downloading from
Quercus followed by uploading them to Jupyter, or clicking
the GitHub link provided on Quercus.
21 / 40
Outline
The Linear Regression Model
Modelling Conditional Means
Least Squares Estimation
Interpreting the Parameters
Introducing the Assumptions
22 / 40
What do the parameters mean?
I Now that we can estimate the statistical relationship in the
population by using our sample, what does it mean?
I Let’s consider the estimated line Ŷ = Xβ̂
β and the
β.
corresponding population mean relationship E (Y | X) = Xβ
I When we estimate a linear relationship using our sample, we
are getting an estimate for the corresponding relationship in
the population.
I
β is an estimate for the vector of parameters β
therefore β̂
Ŷ is therefore an estimate for the vector of conditional means
E (Y | X)
I if we then took an individual with predictor values

β gives us the
x0i = 1 xi1 xi2 . . . xip , then ŷi = x0i β̂
predicted value of the response for that individual
I
I
which is the same as an estimate for the mean response
conditional on those predictor values.
23 / 40
Simple Linear Regression Parameter Interpretation
I The estimated simple linear regression model is ŷi = β̂0 + β̂1 xi
I We just discussed that ŷi is the estimated mean given a value
of x.
I Keeping this in mind, we can also interpret the slope and
intercept as:
I
β̂0 is the mean/average response given the predictor is 0.
I
I
it’s important to also consider whether the intercept has a
meaningful interpretation at all.
β̂1 is the change in the mean/average response for a one unit
change in the value of the predictor.
I
it is NOT how much each response will change for a unit
increase in X, because it is not true that all responses will
change by an equal amount
I
instead it is the expected change for a unit increase in X.
24 / 40
Parameter Interpretation for a Multiple Linear Regression
I To interpret the parameters when working with many
predictors, things get a little trickier.
I Even though we worked with a vector of parameters, we still
interpret each element of that vector individually.
I The intercept is similar to before
I
β̂0 is the average/mean response when ALL predictors have a
value of 0 (assuming it’s meaningful to have a 0 value).
I However, interpreting each slope β̂j , j = 1, . . . , p individually
means we have to ensure that the only change occurring in
the predictor values is the one-unit increase in the predictor
whose parameter we are interpreting.
I
i.e. in order for us to interpret one β̂j correctly, all other
predictor values must be fixed.
I
Then, β̂j is the average/mean change in the response when Xj
increases by one unit, when all other predictors are held fixed.
25 / 40
Conditional Nature of Multiple Regression
I Another feature of working with multiple predictors in a
regression model is that we need to carefully understand the
conditional nature of regression.
I As an example, suppose we collect data on a response and
two predictors.
I We can fit three different models with these variables:
I
A simple model with only X1 , estimated to be
ŷi = 1.86 + 1.30xi1
I
A simple model with only X2 , estimated to be
ŷi = 0.86 + 0.78xi2
I
A two-predictor model, estimated to be
ŷi = 5.37 + 3.01xi1 − 1.29xi2
I If we look at the coefficient for X2 in the simple model, why
did it suddenly change directions/signs?
26 / 40
Poll Question
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
Why did the sign change?
I The two-predictor model was estimated incorrectly.
I The one-predictor models were estimated incorrectly.
I The two-predictor model conditions on the values of X1
27 / 40
Conditional Nature of Multiple Regression
I But what if we highlight all
points with the same value
of X1 ?
I By conditioning on the
value of X1 , we can now see
the decreasing trend
appear.
I When a model contains
I If we ignore X1 , we see the
increasing trend from the
simple model.
more than one predictor, we
must always remember that
it conditions on values of
the other predictors to
estimate each βj
28 / 40
Code-Along Session
In our second Code-Along, we will look at how to use different
types of predictors and how that changes interpretation of
coefficients. We will look at:
I creating informative bivariate plots
I fitting models to subsets of a dataset
I incorporating indicator variables and the change in
interpretation
I incorporating interaction terms and the change in
interpretation
29 / 40
Summary of interpretations
I We saw that indicator variables, depending on how they are
included in a model, change the interpretation of the coefficients.
I Suppose X1 =height and X2 = 1{Male}.
(
ŷi = β̂0 + β̂1 xi1 ,
xi2 = 0
I ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 =
ŷi = (β̂0 + β̂2 ) + β̂1 xi1 , xi2 = 1
(
I ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 ∗ xi1 =
ŷi = β̂0 + β̂1 xi1 ,
xi2 = 0
ŷi = β̂0 + (β̂1 + β̂2 )xi1 , xi2 = 1
I When an indicator takes more than 2 levels, we create dummy
variables for all but 1 of the levels and interpret similarly
I
If X2 takes values A, B, and C, then including X2 in our model
would effectively yield
ŷi = β̂0 + β̂1 xi1 + β̂2 1{Xi2 = A} + β̂3 1{Xi2 = B}
30 / 40
Activity – ∼ 5-10 minutes
Team up in groups of 2-3 people and come up with the correct
interpretation of β1 in the linear relationship below:
[ = −24.5 + 1.65Food + 1.88Decor
Price
BUT, you can only use simple words as accepted by this XKCD
Simple Word Checker (https://xkcd.com/simplewriter/).
Once you have your best answer, go to
PollEv.com/katherinedai702 or open your app (if using) and sign in
and add your answer. If you weren’t able to come up with an
answer, you can also upvote existing answers.
31 / 40
Outline
The Linear Regression Model
Modelling Conditional Means
Least Squares Estimation
Interpreting the Parameters
Introducing the Assumptions
32 / 40
Role of Assumptions in Regression
I As with many statistical procedures and methods,
assumptions are required in order for our regression line to
have important uses.
I These assumptions are necessary in order for us to be able to
make inference about the unknown model parameters.
I This includes:
I
for creating confidence intervals about the unknown model
parameters, the elements of β
I
for building statistical tests for testing possible values of the
unknown model parameters, the elements of β
I In the case of linear regression, the assumptions we make are
regarding the random error terms, .
33 / 40
Assumption 1: Linearity/Mean zero errors
A1. Linearity of the Relationship
Y is related to X by the linear regression model
β +
Y = Xβ
β
or E (Y | X) = Xβ
or E ( | X) = 0
I It’s important to realize that when we fit a linear model, we
are implicitly assuming that a linear relationship exists in the
population.
I But there’s more to this assumption than simply assuming
that it is appropriate to use a linear model
I This assumption also relates to the correctness of your model.
I
It also says that we are assuming only the predictors we are
including in X are actually related to the response
I
all remaining variation in the response should not be able to be
explained by any other predictors, but only due to random
variation.
34 / 40
Assumption 2: Uncorrelated Errors
A2. Covariance of the Errors
The errors are uncorrelated, namely Cov (i , j ) = 0, or equivalently
Cov (yi , yj ) = 0
I This just says that we require that none of the deviations from
the conditional mean be related to one another.
I
analogous to wanting random variables to be independent to
one another, or observations to be sampled independently
I We don’t want the errors to be related to each other, but
rather should appear to be independent and identically
distributed variables.
I
if they are dependent/correlated, then we are working with less
information that we thought we had
I Having correlated error terms means that the predictive ability
of the model will be worse in some areas than in others.
35 / 40
Assumption 3: Common Error Variance
A3. Common Error Variance
The errors i , i = 1, . . . , n have a common variance σ 2 .
I This assumption says that we assume that the population of
responses at any value of the predictors has the same spread.
I Constant error variance is sometimes also called
homoskedasticity.
I If it is violated, then our line will become less accurate as the
residuals become more variable.
I
so our line will accurately estimate conditional means in some
areas, but not in others.
I We want our regression to be equally good at predictions for
all values of X .
36 / 40
Assumption 4: Normality of Errors
A4. Normality of Errors
The errors are Normally distributed, such that  | X ∼ Nn (0, σ 2 I),
β , σ 2 I).
or equivalently Y | X ∼ Nn (Xβ
I This assumption is particularly important for inference (CIs
and tests).
I If the errors are Normal, then it means that we can use all of
the handy properties of Normal distributions, such as linear
combinations of Normal random variables.
I In particular, this will allow us to determine the distribution of
the model parameters so that we may make inference about
them.
37 / 40
Notes on the Assumptions
I None of these assumptions were explicitly used to find the
least squares estimators for the model parameters β .
I We didn’t use maximum likelihood, we aren’t using
variance/covariance at all, but the equation we minimize
requires the linear equation we are estimating to be correct.
I It is very possible (and quite easy) to fit a linear regression
model that will not satisfy these assumptions.
I
e.g. nothing will stop you from fitting a straight line to a
curved relationship… it just won’t be particularly useful.
I However, when assumptions 1-3 are satisfied, the least squares
estimator of β will be unbiased and have minimum variance
among all other linear unbiased estimators (i.e. it’s the best
one).
I We will show later how to determine unbiasedness and to find
the variance for β .
38 / 40
Code-Along Session
Our last short Code-Along will look into techniques to very
informally check whether we might anticipate any problems with
model assumptions. These are not formal checks, but can warn
you about potential issues down the road. We will use
I Scatterplots to inspect linearity and constant variance
I Histograms to inspect linearity and normality
I Critical thinking to inspect uncorrelated errors.
39 / 40
Wrapping up
I Linear regression models attempt to describe a statistical
relationship that is occurring in a population.
I
we can use all sorts of different predictors, but it changes how
we interpret the coefficients.
I The notion of conditioning and conditional
distributions/relationships is an important one.
I
We interpret each parameter by holding other predictors fixed.
I
We will get predicted values by conditioning on values of the
predictors.
I
Estimated coefficients will change when adding more predictors
to the model because all the predictors are conditionally
related to the response.
I We also found that we will need assumptions in order to
ensure our estimators have good properties and yield the
results we expect.
40 / 40
STA302/1001: Methods of Data Analysis 1
Instructor: Katherine Daignault
Department of Statistical Sciences
University of Toronto
Week 4 (Oct. 3 – 7)
1 / 46
Outline
Assumptions and Properties of Estimators
Assumptions for Linear Regression
Properties of Residuals
Sampling Distributions of the LS Estimators
Intervals and Hypothesis Tests
For the estimated coefficients and mean response
For an actual individual response
2 / 46
Week 4 Learning Goals
This week we will learn about the assumptions that are required in
linear regression and how these yield really nice inferential
properties in our estimators of the coefficients. We will use these
to derive sampling distributions, confidence/prediction intervals,
and hypothesis tests. To that end, the learning goals are:
I use assumptions to derive properties of estimators
I compute appropriate confidence/prediction intervals and
hypothesis tests.
I conclude and interpret the results of a confidence/prediction
interval and test.
I differentiate between using a regression model to estimate a
parameter versus a future observation
3 / 46
Outline
Assumptions and Properties of Estimators
Assumptions for Linear Regression
Properties of Residuals
Sampling Distributions of the LS Estimators
Intervals and Hypothesis Tests
For the estimated coefficients and mean response
For an actual individual response
4 / 46
Role of Assumptions in Regression
I As with many statistical procedures and methods,
assumptions are required in order for our regression line to
have important uses.
I These assumptions are necessary in order for us to be able to
make inference about the unknown model parameters.
I This includes:
I
for creating confidence intervals about the unknown model
parameters, the elements of β
I
for building statistical tests for testing possible values of the
unknown model parameters, the elements of β
I In the case of linear regression, the assumptions we make are
regarding the random error terms, .
5 / 46
Assumption 1: Linearity/Mean zero errors
A1. Linearity of the Relationship
Y is related to X by the linear regression model
β +
Y = Xβ
β
or E (Y | X) = Xβ
or E ( | X) = 0
I It’s important to realize that when we fit a linear model, we
are implicitly assuming that a linear relationship exists in the
population.
I But there’s more to this assumption than simply assuming
that it is appropriate to use a linear model
I This assumption also relates to the correctness of your model.
I
It also says that we are assuming only the predictors we are
including in X are actually related to the response
I
all remaining variation in the response should not be able to be
explained by any other predictors, but only due to random
variation.
6 / 46
Assumption 2: Uncorrelated Errors
A2. Covariance of the Errors
The errors are uncorrelated, namely Cov (i , j ) = 0, or equivalently
Cov (yi , yj ) = 0
I This just says that we require that none of the deviations from
the conditional mean be related to one another.
I
analogous to wanting random variables to be independent to
one another, or observations to be sampled independently
I We don’t want the errors to be related to each other, but
rather should appear to be independent and identically
distributed variables.
I
if they are dependent/correlated, then we are working with less
information that we thought we had
I Having correlated error terms means that the predictive ability
of the model will be worse in some areas than in others.
7 / 46
Assumption 3: Common Error Variance
A3. Common Error Variance
The errors i , i = 1, . . . , n have a common variance σ 2 .
I This assumption says that we assume that the population of
responses at any value of the predictors has the same spread.
I Constant error variance is sometimes also called
homoskedasticity.
I If it is violated, then our line will become less accurate as the
residuals become more variable.
I
so our line will accurately estimate conditional means in some
areas, but not in others.
I We want our regression to be equally good at predictions for
all values of X .
8 / 46
Assumption 4: Normality of Errors
A4. Normality of Errors
The errors are Normally distributed, such that  | X ∼ Nn (0, σ 2 I),
β , σ 2 I).
or equivalently Y | X ∼ Nn (Xβ
I This assumption is particularly important for inference (CIs
and tests).
I If the errors are Normal, then it means that we can use all of
the handy properties of Normal distributions, such as linear
combinations of Normal random variables.
I In particular, this will allow us to determine the distribution of
the model parameters so that we may make inference about
them.
9 / 46
Notes on the Assumptions
I None of these assumptions were actually needed in order to
find the least squares estimators for the model parameters β .
I This is because the least squares process is a distribution-free
estimation method.
I It therefore means that it is possible (and quite easy) to fit a
linear regression model that will not satisfy these assumptions.
I
e.g. nothing will stop you from fitting a straight line to a
curved relationship… it just won’t be particularly useful.
I However, when assumptions 1-3 are satisfied, the least squares
estimator of β will be unbiased and have minimum variance
among all other linear unbiased estimators (i.e. it’s the best
one).
I We will show later how to determine unbiasedness and to find
the variance for β .
10 / 46
Code-Along Session
Our first short Code-Along will look into techniques to very
informally check whether we might anticipate any problems with
model assumptions. These are not formal checks, but can warn
you about potential issues down the road. We will use
I Scatterplots to inspect linearity and constant variance
I Histograms to inspect linearity and normality
I Critical thinking to inspect uncorrelated errors.
11 / 46
Poll Question 1
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
Are these preliminary checks on assumptions enough to know
for certain whether the assumptions on the errors hold?
I Yes
I No
12 / 46
Outline
Assumptions and Properties of Estimators
Assumptions for Linear Regression
Properties of Residuals
Sampling Distributions of the LS Estimators
Intervals and Hypothesis Tests
For the estimated coefficients and mean response
For an actual individual response
13 / 46
Estimator of the Error Variance
I In many of the assumptions listed in the previous section, we
are working with errors and an error variance – all elements of
the population that need to be estimated.
I We’ve already seen that the residuals of our least squares
regression model are observations of the population errors.
I
namely ê is an observation for 
I So how do we find an estimate of the error variance, σ 2 ?
I If Var (i ) = σ 2 = E [(i − E (i ))2 ] = E (2 ) by the
i
assumptions, then a reasonable estimator would involve
averaging the square of the observed residuals.
I We actually get that the estimate of the error variance is
Pn
Pn
2
(yi − ŷi )2
ê0 ê
RSS
i=1 êi
=
=
= i=1
s =
n−p−1
n−p−1
n−p−1
n−p−1
2
where p is the number of predictors in the model.
14 / 46
Estimator of the Error Variance
I You may be asking, why are we not dividing by n if we are
taking an average with a sample?
I We could do that, and it would be an estimate of the error
variance too.
I However, if we use s 2 , we would get a better estimate (i.e.
unbiased).
I Intuitively, we use n − p − 1 as a divisor because we have
estimated p + 1 parameters in the regression model and have
to account for these new values by taking information away
from the sample.
I
This is the same reason why, when we compute a sample
variance, we divide by n − 1 instead of n.
I
We need to account for having used the data once before to
estimate the sample mean, and so we take away one data
value for this newly introduced information.
15 / 46
Notes on the LS Estimator of the Error Variance
I The estimate of the error variance s 2 is an unbiased estimate
of σ 2
I
For details on how to prove unbiasedness of this estimate, see
Rencher Chapter 2 and 5.
I
We won’t go into these details here because they utilize
properties of quadratic forms which are not something that
everyone may be familiar with.
I We will soon see that the LS estimator for β is unbiased.
I Turns out both the LS estimators for β and the error variance
σ 2 is also the ‘best’ ones.
They will also have minimum variance among all other
unbiased estimators of a particular type.
I However, where the LS estimator of β is best among all
unbiased linear estimators, s 2 is best among all unbiased
quadratic estimators.
I
I
This is because it is expressed as a quadratic equation or
quadratic form.
16 / 46
Outline
Assumptions and Properties of Estimators
Assumptions for Linear Regression
Properties of Residuals
Sampling Distributions of the LS Estimators
Intervals and Hypothesis Tests
For the estimated coefficients and mean response
For an actual individual response
17 / 46
Properties of LS Estimators
I It’s always important to learn about the properties of your
estimators.
I Specifically, we want to know whether the LS estimator β̂
β is
unbiased, how variable it is, and whether we can determine its
sampling distribution.
I This will involve working with the errors/residuals as well as
the assumptions.
I As a reminder, our assumptions essentially can be combined
to be
β , σ 2 I)
Y | X ∼ Nn (Xβ
I We will use these assumptions to determine the sampling
β
distribution of β̂
I We will also use results from Review Slides of Week 0.
18 / 46
Covariance Matrices
I Before jumping into our derivation, let’s remind ourselves of
how a covariance matrix works.
I Everything we do will come down to working with the
distribution of the errors
 | X ∼ Nn (0, σ 2 I)
I This says the vector of random errors has a mean vector of 0
and a covariance matrix which is a diagonal matrix with σ 2
along the main diagonal and 0’s elsewhere.
I
The main diagonals represent Var (i ) = σ 2 for all i, and the
off diagonal elements are Cov (i , j ) = 0 for all i 6= j.
I
So when working with a vector of random variables, you work
with a covariance matrix so that you have information about
the individual variances but also how the elements of the
vector co-vary with each other.
19 / 46
β
Expectation and Covariance of β̂
20 / 46
Poll Question 2
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
How many of the assumptions did we use in deriving these
properties?
I None
I One
I Two
I Three
I Four
21 / 46
Coefficients are not uncorrelated
I Consider the covariance matrix of β̂
β in simple linear regression:
2
0
−1
β ) = σ (X X)
Cov (β̂
σ2
=
SXX
Pn
2
/n −x̄
−x̄
1
i =1 xi
!
(exercise: check that you can derive this).
I Knowing that the covariance matrix will look similar (but
much larger) for multiple regression, we can see that the
off-diagonal elements would not necessarily equal 0.
I This tells us that the estimated coefficients of any two
predictors in a multiple linear model may be correlated.
I Even in simple linear regression, the slope and intercept may
have a non-zero covariance.
I This again demonstrates the conditional nature of regression
and how we must always consider how the components we
work with co-vary/are related.
22 / 46
β
Sampling Distribution of β̂
I Based on the assumptions, we have that Y | X ∼ Nn (Xβ
β , σ 2 I)
I Even though we are working with a multivariate Normal here,
it still follows the same rules regarding linearity of Normal
random variables (see Week 0 Review Slides).
I We have that β̂
β is a linear combination of Normal random
variables because
β = (X0 X)−1 X0 Y = AY
β̂
I Linearity of Normals says AY ∼ Nn (Aµ
µy , AΣ
ΣA)
I The mean and covariance matrix for our Normal distribution
were found to be β and σ 2 (X0 X)−1 respectfully (and were
found doing exactly this process).
I Therefore the sampling distribution of β̂
β is
β , σ 2 (X0 X)−1 )
Np+1 (β
23 / 46
Poll Question 3
Go to PollEv.com/katherinedai702 or open your app (if using) and
sign in.
Given a covariance matrix for β from a model that fit 3 predictors,
where would we find the variance of β2 in this covariance
matrix?
I position (1, 1)
I position (2, 2)
I position (3, 3)
I position (4, 4)
24 / 46
Estimating the Variance in the Sampling Distribution
I The sampling distribution of the estimated regression
coefficients will become quite useful.
I However, we can only work with the sampling distribution if
we have a way to estimate the mean and variance of the
Normal.
I
The mean is easy… it’s simply our estimated regression
coefficients.
I
For the variance, the inverse matrix is easily calculated using
our data.
I
And we’ve already found an estimate of the population error
RSS
, and we can simply use this in place of σ 2
variance, s 2 = n−p−1
I We do need to be careful though because using s 2 gives us an
β , which means β̂
β will no longer be
estimated covariance of β̂
Normally distributed.
I
Instead, to account for added uncertainty from the estimate,
we will use a Tn−p−1 distribution (like in Week 0 Review
Slides).
25 / 46
Code-Along Session
In this Code-Along, we will see how to compute and extract the
assorted variance terms we have discussed. We will focus on:
I extracting estimated error variance,
I extracting standard errors of each beta estimate
I extracting full covariance matrix for beta estimates
26 / 46
Outline
Assumptions and Properties of Estimators
Assumptions for Linear Regression
Properties of Residuals
Sampling Distributions of the LS Estimators
Intervals and Hypothesis Tests
For the estimated coefficients and mean response
For an actual individual response
27 / 46
Confidence in Estimates
I Now that we can find the least squares estimates of the model
parameters, we need to determine how confident we are that
we have captured the true parameters.
I Recall that a confidence interval reflects how drawing a
different sample from the population will give different
estimates of the parameters.
I
It is a statement about the confidence we have in our sample.
I
e.g. a 95% CI represents the percentage of confidence intervals
created from other samples of the same size as ours that will
capture the true parameter value.
I Since we use a sample to estimate the parameters in the
regression line, we must have corresponding CIs to reflect the
margin of error of our estimates.
28 / 46
Creating Confidence Intervals (CIs) and Hypothesis Tests
I CIs and Hypothesis tests are constructed from the same
– truth
quantity, called a pivotal quantity: pivotal = estimator
standard error
I Both CIs and tests compare this pivotal quantity to the
sampling distribution of the estimator, namely
estimator ∼ N(truth, standard error2 )
I
CIs create a probabilistic statement that references the
likelihood of obtaining an estimate a specific distance from the
truth.
I
Hypothesis tests instead use the distribution to comment on
the likelihood that an estimated value could have arisen from
this distribution.
I For linear regression, since the standard error we work with is
an estimated value, the Normal distribution is not variable
enough to capture the estimation error of both β̂ and s 2 , so
we use the T distribution instead of a Normal.
29 / 46
CI and Test for individual βj
CI: estimate ± (critical value)(standard error)
truth
Test statistic: point estimate−possible
standard error
Quantity
βj
(1 − α)% interval
q
β j+1 ± t α2 ,n−p−1 s (X0 X)−1
β̂
(j+1,j+1)
Test Statistic
Distribution
β j+1 −βj0
β̂
r
s
(X0 X)−1
(j+1,j+1)
Tn−p−1
I α is the chosen significance level (often 0.05), while 1 − α is the
chosen confidence level (often 0.95).
I The degrees of freedom of the T distribution are the same as the
denominator of our estimate s 2 .
I Matrices begin their indexing at 1, not 0, so to extract the right
element corresponding to β̂j , you increase the index by 1.
I The same test statistic is used regardless of whether testing
Ha : βj 6= βj0 or Ha : βj > βj0 (or also < βj0 ). 30 / 46 Inference on individual βj I When conducting a hypothesis test on βj , we can test any hypothesized value for this parameter. I However, the default is to test whether βj = 0. I This reflects testing whether there is no linear relationship between Xj and Y while holding other predictors fixed. I We can also opt for one or two-sided tests, but the default is two-sided because the alternative hypothesis to no relationship is that a relationship exists. I Rejection of the null hypothesis can be determined using a p-value (e.g. P(|t ∗ | ≥ t α2 ,n−p−1 ) < α if two-sided) or by comparison to a critical value (e.g. |t ∗ | ≥ t α2 ,n−p−1 ) I As with the hypothesis test, when interpreting our CI, we must also incorporate the notion that we are 95% confident that this interval captures the true linear relationship between Xj and Y in the presence of other fixed predictors. 31 / 46 Sampling distribution of mean response I We can also perform inference on the mean response \ β = E (Y β. E (Y | X), estimated by Ŷ = Xβ̂ | X), where ŷi = x0i β̂ I Similar to β , we would make inference on a single mean response y0 = E (Y | X = x00 ) = x00β , rather than the entire vector of all mean responses. I Here, x00 = (1, x1 , x2 , . . . , xp ) has a specific value for each β. predictor, and we estimate E (Y | X = x0 ) by ŷ0 = x00β̂ I The sampling distribution of ŷ0 = x0 β̂ 0 β is ŷ0 | X, x0 ∼ N(x00β , σ 2 x00 (X0 X)−1 x0 ) I I I the estimator is unbiased β | X, x0 ) = x00 E (β̂ β | X, x0 ) = x00β E (ŷ0 | X, x0 ) = E (x00β̂ β | X, x0 ) = and has variance Var (ŷ0 | X, x0 ) = Var (x00β̂ β | X, x0 )x0 = σ 2 x00 (X0 X)−1 x0 x00 Var (β̂ ŷ0 is a linear combination of Y which gives Normality. 32 / 46 CI and Test for mean response CI: estimate ± (critical value)(standard error) truth Test statistic: point estimate−possible standard error Quantity βj ŷ0 = x00β (1 − α)% interval q (X0 X)−1 (j+1,j+1) β j+1 ± t α2 ,n−p−1 s β̂ β ± t α2 ,n−p−1 s x00β̂ p x00 (X0 X)−1 x0 Test Statistic Distribution β j+1 −βj0 β̂ r s 0 s Tn−p−1 (X0 X)−1 (j+1,j+1) 0 0 √ x00β̂β −y 0 −1 x0 (X X) Tn−p−1 x0 I Once again, we use T distribution for critical values as Normal only works if σ 2 is known. I Hypothesis tests for mean response are not very common, but can be used for testing a specific value y00 . I The simple regression version of Var (ŷ0 | x0 , X ) = σ 2  (x0 −x̄) 1 n + SXX 2  tells us that the variance will be larger at x0 that is far from x̄ (i.e. predictions are better in the middle of the data). 33 / 46 Simple Linear Regression versions I The matrix-based formulae will work for simple linear models too, but sometimes it’s easier to compute these CIs and test statistics in an algebra-based framework. I The fundamental change is the expression of the variance. I We’ve just seen that the variance of the estimator of the mean  (x0 −x̄)2 2 1 response is Var (ŷ0 | x0 , X ) = σ n + SXX I β in simple The covariance matrix  Pn of 2β̂  regression can be written −x̄ i =1 xi /n σ2 β ) = SXX as Cov (β̂ −x̄ 1 I So we have 2 σ Var (β̂0 ) = SXX P n 2 i=1 xi n  = σ2  x̄ 2 1 + SXX n  , 2 σ Var β̂1 = SXX I These can give us better insight into how the variance of our estimators depends on a multitude of factors. 34 / 46 Poll Question 4 Go to PollEv.com/katherinedai702 or open your app (if using) and sign in. True of False: if my predictor is highly variable in simple regression, the variance of my LS estimators will increase. 35 / 46 Exercise - Give it a try! We have to following summary measures for the earlier data: P20 P20 P20 2 i=1 xi = 4035 i=1 yi = 4041 i=1 ei = 4753.125 P20 2 P20 i=1 xi = 1005535 i=1 xi yi = 864910 If β̂1 = 0.259, find a 95% confidence interval for the mean response at X = 200 (use t α2 ,18 = 2.10). 36 / 46 Outline Assumptions and Properties of Estimators Assumptions for Linear Regression Properties of Residuals Sampling Distributions of the LS Estimators Intervals and Hypothesis Tests For the estimated coefficients and mean response For an actual individual response 37 / 46 Predicting a Future Observation I There is a distinction between predicting the mean response in the population and predicting the actual response of an individual member of the population. I The mean response at a specific value of the predictor is a parameter, y0 = E (Y | X = x00 ) I I the expected value is what we would expect a response Y to be in the long run when X = x00 so E (Y | X = x00 ) is a fixed but unknown quantity. I The actual response for an individual with a specific value of the predictor is a realization of the random variable, y0 I Because y0 is a random variable, it can take any number of values when X = x00 I It also may not lie on the population regression line. 38 / 46 Predicting a Future Observation I If we want an actual response y0 but we can only get an estimate from the regression line ŷ0 , then our inference will need to account for the distance that y0 is from the regression line. I Thus, we build a prediction interval to provide a range of possible values for the future observation. I The error in this prediction from using a regression line for prediction is y0 − ŷ0 = x00β + 0 − ŷ0 = (x00β − ŷ0 ) + 0 I This says that difference in the actual response and the one we predict with our regression line is based on how well we estimate the conditional mean (x00β − ŷ0 ) plus the natural variation in the conditional distribution (0 ) 39 / 46 Mean and Variance of Prediction Error I In the population y0 = x0 β + 0 , so we can predict y0 by 0 β because this is all our regression line can provide. ŷ0 = x00β̂ I Using the prediction error as written previously, we can determine a distribution for the prediction error which we will use to get our prediction interval. I We find E (y0 − ŷ0 | X, x0 ) = 0 I We also know that y0 and ŷ0 are independent because the observations that go into finding ŷ0 are sampled randomly from the same population as y0 . I This let’s us find the variance in the prediction error: β) Var (y0 − ŷ0 | X, x0 ) = Var (x00β + 0 − x00β̂ β) = Var (0 ) + Var (x00β̂ = σ 2 + σ 2 x00 (X0 X)−1 x0 = σ 2 [1 + x00 (X0 X)−1 x0 ] 40 / 46 Distribution of Prediction Error I Now, we know the average prediction error and how variable that prediction error will be. I Thus, using the same arguments as before, we can obtain a Normality result which says the prediction error is y0 − ŷ0 | X, x0 ∼ N(0, σ 2 [1 + x00 (X0 X)−1 x0 ]) I Once again, in practice we do not know the value of σ 2 and so would need to estimate it using s 2 . I Then the distribution of prediction errors is better described by a Tn−p−1 distribution than a Normal. I We create an interval similarly to confidence intervals (by using the distribution to measure a certain number of standard errors away from a centre). I But because we are not trying to estimate a parameter (we are seeking an observed value), we cannot call this a confidence interval as confidence specifically refers to parameters. 41 / 46 Prediction Interval for an actual response CI: estimate ± (critical value)(standard error) truth Test statistic: point estimate−possible standard error Quantity βj y0 = x00β y0p = x00β (1 − α)% interval q β j+1 ± t α2 ,n−p−1 s (X0 X)−1 β̂ (j+1,j+1) p β ± t α2 ,n−p−1 s x00 (X0 X)−1 x0 x00β̂ p β ± t α2 ,n−p−1 s 1 + x00 (X0 X)−1 x0 x00β̂ Test Statistic β j+1 −βj0 β̂ r s (X0 X)−1 (j+1,j+1) 0 s Distribution Tn−p−1 0 0 √ x00β̂β −y 0 −1 x0 (X X) NA x0 Tn−p−1 Tn−p−1 I Note that y0p is used simply as a way to distinguish the interval for the mean response and for an actual response. I they are equivalent values since the regression line can only estimate a point on itself. I The algebraic version for simple linear regression is β̂0 + β̂1 x0 ± t α2 ,n−2 s q 2 0 −x̄) 1 + n1 + (xSXX . 42 / 46 Poll Question 5 I Blue lines 250 Production Time 200 150 I Red lines 100 Which interval is the confidence interval for the mean response? 300 Go to PollEv.com/katherinedai702 or open your app (if using) and sign in. 50 100 150 200 250 300 350 Order Size 43 / 46 Notes on the Prediction Interval I You may have noticed that the prediction interval results in a wider interval than a confidence interval for the conditional mean response, even though they are centred at the same value. I We can see why this is the case looking at the formulae for the variance. I We have variation due to estimating the conditional mean response (i.e. σ 2 x00 (X0 X)−1 x0 ) I But because we are predicting an actual observation, we also have variation in the response distribution, because the random variable could take any value from the Normal with variance σ 2 I. I Therefore prediction intervals are wider than confidence intervals because they must capture 100(1 − α)% of the response distribution to reflect the most likely 95% of response values the random variable could take. 44 / 46 Code-Along Session In this Code-Along, we will work through how to conduct these inferential techniques in R. We will see how to: I conduct hypothesis tests on each regression coefficient I build confidence intervals on each regression coefficient I build confidence intervals on the mean response given a set of predictor values I build prediction intervals for an actual observed response. 45 / 46 Wrapping Up I We have derived a number of important inferential tools that we will continue to use throughout the course: I We have hypothesis tests/CIs for determining whether a single predictor is significantly linearly related to the response in the presence of the other predictors. I We have hypothesis tests/CIs for determining whether a certain conditional mean response parameter value is plausible. I We have a prediction interval that allows us to provide a range of possible future values of an observed response. I All of our results however rely heavily on the assumptions of linear regression being satisfied. I Next week, we will see how to use these and other inferential tools to refine a regression model. 46 / 46 STA302/1001: Methods of Data Analysis 1 Instructor: Katherine Daignault Department of Statistical Sciences University of Toronto Week 5 (Oct. 10-14) 1 / 44 Outline Intervals and Inference For an actual individual response (last week) Decomposing the Variation in the Response Sum of Squares Decomposition Coefficient of Determination ANOVA F Test Partial F Test 2 / 44 Week 5 Learning Goals In this week, we will see that regression models break down variation in the response into two components: that which is explained by the predictors and that which is not. We will develop two tests for determining the significance of the linear relationship, as well as how to quantify how much variation is explained by your model. To that end, the learning goals are: I apply the appropriate test and define appropriate hypotheses for each test. I correctly conclude tests for significance of the linear relationship. I describe how the tests compare sources of variation and how this leads to our conclusions. I explain the coefficient of determination and use it appropriately. 3 / 44 Outline Intervals and Inference For an actual individual response (last week) Decomposing the Variation in the Response Sum of Squares Decomposition Coefficient of Determination ANOVA F Test Partial F Test 4 / 44 Predicting a Future Observation I There is a distinction between predicting the mean response in the population and predicting the actual response of an individual member of the population. I The mean response at a specific value of the predictor is a parameter, y0 = E (Y | X = x00 ) I I the expected value is what we would expect a response Y to be in the long run when X = x00 so E (Y | X = x00 ) is a fixed but unknown quantity. I The actual response for an individual with a specific value of the predictor is a realization of the random variable, y0 I Because y0 is a random variable, it can take any number of values when X = x00 I It also may not lie on the population regression line. 5 / 44 Predicting a Future Observation I If we want an actual response y0 but we can only get an estimate from the regression line ŷ0 , then our inference will need to account for the distance that y0 is from the regression line. I Thus, we build a prediction interval to provide a range of possible values for the future observation. I The error in this prediction from using a regression line for prediction is y0 − ŷ0 = x00β + 0 − ŷ0 = (x00β − ŷ0 ) + 0 I This says that difference in the actual response and the one we predict with our regression line is based on how well we estimate the conditional mean (x00β − ŷ0 ) plus the natural variation in the conditional distribution (0 ) 6 / 44 Mean and Variance of Prediction Error I In the population y0 = x0 β + 0 , so we can predict y0 by 0 β because this is all our regression line can provide. ŷ0 = x00β̂ I Using the prediction error as written previously, we can determine a distribution for the prediction error which we will use to get our prediction interval. I We find E (y0 − ŷ0 | X, x0 ) = 0 I We also know that y0 and ŷ0 are independent because the observations that go into finding ŷ0 are sampled randomly from the same population as y0 . I This let’s us find the variance in the prediction error: β) Var (y0 − ŷ0 | X, x0 ) = Var (x00β + 0 − x00β̂ β) = Var (0 ) + Var (x00β̂ = σ 2 + σ 2 x00 (X0 X)−1 x0 = σ 2 [1 + x00 (X0 X)−1 x0 ] 7 / 44 Distribution of Prediction Error I Now, we know the average prediction error and how variable that prediction error will be. I Thus, using the same arguments as before, we can obtain a Normality result which says the prediction error is y0 − ŷ0 | X, x0 ∼ N(0, σ 2 [1 + x00 (X0 X)−1 x0 ]) I Once again, in practice we do not know the value of σ 2 and so would need to estimate it using s 2 . I Then the distribution of prediction errors is better described by a Tn−p−1 distribution than a Normal. I We create an interval similarly to confidence intervals (by using the distribution to measure a certain number of standard errors away from a centre). I But because we are not trying to estimate a parameter (we are seeking an observed value), we cannot call this a confidence interval as confidence specifically refers to parameters. 8 / 44 Prediction Interval for an actual response CI: estimate ± (critical value)(standard error) truth Test statistic: point estimate−possible standard error Quantity βj y0 = x00β y0p = x00β (1 − α)% interval q β j+1 ± t α2 ,n−p−1 s (X0 X)−1 β̂ (j+1,j+1) p β ± t α2 ,n−p−1 s x00 (X0 X)−1 x0 x00β̂ p β ± t α2 ,n−p−1 s 1 + x00 (X0 X)−1 x0 x00β̂ Test Statistic β j+1 −βj0 β̂ r s (X0 X)−1 (j+1,j+1) 0 s Distribution Tn−p−1 0 0 √ x00β̂β −y 0 −1 x0 (X X) NA x0 Tn−p−1 Tn−p−1 I Note that y0p is used simply as a way to distinguish the interval for the mean response and for an actual response. I they are equivalent values since the regression line can only estimate a point on itself. I The algebraic version for simple linear regression is β̂0 + β̂1 x0 ± t α2 ,n−2 s q 2 0 −x̄) 1 + n1 + (xSXX . 9 / 44 Poll Question 1 I Blue lines 250 Production Time 200 150 I Red lines 100 Which interval is the confidence interval for the mean response? 300 Go to PollEv.com/katherinedai702 or open your app (if using) and sign in. 50 100 150 200 250 300 350 Order Size 10 / 44 Notes on the Prediction Interval I You may have noticed that the prediction interval results in a wider interval than a confidence interval for the conditional mean response, even though they are centred at the same value. I We can see why this is the case looking at the formulae for the variance. I We have variation due to estimating the conditional mean response (i.e. σ 2 x00 (X0 X)−1 x0 ) I But because we are predicting an actual observation, we also have variation in the response distribution, because the random variable could take any value from the Normal with variance σ 2 I. I Therefore prediction intervals are wider than confidence intervals because they must capture 100(1 − α)% of the response distribution to reflect the most likely 95% of response values the random variable could take. 11 / 44 Code-Along Session In this Code-Along, we will work through how to conduct these inferential techniques in R. We will see how to: I conduct hypothesis tests on each regression coefficient I build confidence intervals on each regression coefficient I build confidence intervals on the mean response given a set of predictor values I build prediction intervals for an actual observed response. 12 / 44 Outline Intervals and Inference For an actual individual response (last week) Decomposing the Variation in the Response Sum of Squares Decomposition Coefficient of Determination ANOVA F Test Partial F Test 13 / 44 Regression Explains Variation in Response I We have seen that a linear model is often fit because we are trying to estimate a relationship between a response and some number of predictors in the population. I When working with a single predictor, we can talk about this as trying to use X to explain the pattern that we observe in our response Y . I This can also be thought of as using X to explain the variation we observe in Y . I Last module, we talked about testing individual coefficients to determine if they are significantly linearly related to the response in the presence of other predictors. I In simple regression, this is actually the same as testing whether our single X significantly explains the variation/pattern in Y . I We can use this idea of explaining variation to create new tests and summaries for our models. 14 / 44 Poll Question 2 Go to PollEv.com/katherinedai702 or open your app (if using) and sign in. Which graph displays data that would be more likely to yield a rejection of the null hypothesis of no linear relationship? I Graph A I Graph B 15 / 44 Variation and the Regression Line I Looking at these, one might intuitively think that the regression line would be better at representing the linear relationship when the relationship is more visually obvious. I In fact, the clearer relationship would indeed be more likely to yield a significant t-test on the slope than the less clear relationship. I This is because having more variation in the response means there is more variability for the predictor to try to explain. I Therefore we may have more variation that is unexplained by the predictor, i.e. larger residual sum of squares. 16 / 44 A Decomposition of Variation I Let’s consider the linear regression model Y = Xβ β + . I This inherently is saying that the value of the response is composed of two parts: the part explained by the values of the predictors, and the random variability in the distribution. I We can think of our sample and its variability in the same way: I We have a certain amount of variation in our sampled responses (we can determine this with a sample variance). The regression line fit through our data can be used to say that a certain amount of the variation in the responses is due to this relationship. I Lastly, we have the residuals that talk about how different each data point is from the model (or the pattern described by the model) I I if we take the estimated error variance, this represents the leftover variation in the response not explained by the model. 17 / 44 A Decomposition of Variation I We can write out this relationship between various sources of variation with equations. I Variation will be expressed as sums of squares (like the RSS) - sums of the squared deviations between two quantities. I The original amount of variation we start with is our total sum of squares (SST). I Pn We express it as SST = i=1 (yi − ȳ )2 , or (n − 1)sy2 , where sy2 is the sample variance of the response. I The residual amount of variation leftover after fitting a regression model is the residual sum of squares (RSS). I Recall this is RSS = Pn 2 i=1 êi = Pn i=1 (yi − ŷi ) 2 I Lastly the variation explained by the model is the regression sum of squares (SSreg). I Since no relationship between P Y and X would be a horizontal n line at ȳ , we express SSreg = i=1 (ŷi − ȳ )2 . 18 / 44 Decomposition of the Sum Of Squares I Putting these pieces together, we get the sum of squares decomposition: n n n X X X (yi −ȳ )2 = (ŷi −ȳ )2 + (yi −ŷi )2 i=1 i=1 i=1 or SST = SSreg + RSS. I In the matrix framework, this can be written as     0 0 0 1 01 β X Y − Y JY + Y0 Y − β̂ β X0 Y Y (I − J)Y = β̂ n n 0 where J is a square matrix of ones (see Rencher Chapter 5.1 to see why SST is written like this). I In the next section, we will see how to use this decomposition to talk about how much response variation the regression model explains. 19 / 44 A Numerical Example Suppose we collect a sample of 20 observations on both a response (Y) and a single predictor (X). We find that the mean response in the sample is 202.05 while the sample variance in the response is 927.5237. A simple linear model is fit and the estimated error variance is 264.1431. Find the components of the sum of squares decomposition. 20 / 44 Visualizing with Venn/Euler Diagrams 21 / 44 Poll Question 3 Go to PollEv.com/katherinedai702 or open your app (if using) and sign in. If we wanted to measure the ”goodness of a model”, i.e. how well the model explains the initial variation in the response, what could we use? I a hypothesis test on the slope I the correlation between X and Y I the estimated variance in the errors I a ratio of regression sum of squares and total sum of squares 22 / 44 Outline Intervals and Inference For an actual individual response (last week) Decomposing the Variation in the Response Sum of Squares Decomposition Coefficient of Determination ANOVA F Test Partial F Test 23 / 44 Quantifying Amount of Variation Explained I We saw that fitting a regression model can also be interpreted as explaining some of the variation observed in the response. I We found that we can take the total variation (given by SST) and partition/decompose it into two pieces: I The portion that the model/predictors explains (SSreg) I The portion that is leftover/unexplained (RSS) I When fitting different models on the same sample, the SST will be the same. I However, consider two statisticians working on a similar problem but on two different samples of data. I They both fit a model using two predictors, but they happen to pick two different predictors. I While we could look at each model’s SSreg to see which model explains more variation, it will be difficult to know who had the better model because the SST will be different. 24 / 44 Coefficient of Determination, R 2 I The issue with strictly comparing the SSreg values is that the data is changing. I So we can “standardize” the SSreg by the SST so that the value no longer depends on the original variation in the responses. I This gives us what is called the coefficient of determination (R 2 ), given by R2 = RSS SSreg =1− SST SST I The coefficient of determination has some nice characteristics: It can also be computed by squaring the sample correlation when working with a simple linear model I It actually measures the proportion of the variation in the response that is explained by the model. I 25 / 44 Notes on Using the Coefficient of Determination I The coefficient of determination is really just a description or summary measure that can be used to help discuss the performance of your model. I It is not a formal test so...

Don't use plagiarized sources. Get Your Custom Essay on
STA302 Final Project
Just from $13/Page
Order Essay
Achiever Essays
Calculate your paper price
Pages (550 words)
Approximate price: -

Why Work with Us

Top Quality and Well-Researched Papers

We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.

Professional and Experienced Academic Writers

We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.

Free Unlimited Revisions

If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.

Prompt Delivery and 100% Money-Back-Guarantee

All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.

Original & Confidential

We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.

24/7 Customer Support

Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.

Try it now!

Calculate the price of your order

Total price:
$0.00

How it works?

Follow these simple steps to get your paper done

Place your order

Fill in the order form and provide all details of your assignment.

Proceed with the payment

Choose the payment system that suits you most.

Receive the final file

Once your paper is ready, we will email it to you.

Our Services

No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.

Essays

Essay Writing Service

No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.

Admissions

Admission Essays & Business Writing Help

An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.

Reviews

Editing Support

Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.

Reviews

Revision Support

If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.

Live Chat+1(978) 822-0999EmailWhatsApp

Order your essay today and save 20% with the discount code RESEARCH

slot online
seoartvin escortizmir escortelazığ escortbacklink satışbacklink saleseskişehir oto kurtarıcıeskişehir oto kurtarıcıoto çekicibacklink satışbacklink satışıbacklink satışbacklink