1) List what you consider are the most important concepts you have learned in this course.
2) What study tips would you give to students who enroll in this class next term? You can discuss whatever you think might be useful: time-management tips, what to study, how to study, what to do, or what not to do.
Introduction to Statistical Thinking
(With R, Without Calculus)
Benjamin Yakir, The Hebrew University
March, 2011
2
In memory of my father, Moshe Yakir, and the family he lost.
ii
Preface
The target audience for this book is college students who are required to learn
statistics, students with little background in mathematics and often no motivation to learn more. It is assumed that the students do have basic skills in using
computers and have access to one. Moreover, it is assumed that the students
are willing to actively follow the discussion in the text, to practice, and more
importantly, to think.
Teaching statistics is a challenge. Teaching it to students who are required
to learn the subject as part of their curriculum, is an art mastered by few. In
the past I have tried to master this art and failed. In desperation, I wrote this
book.
This book uses the basic structure of generic introduction to statistics course.
However, in some ways I have chosen to diverge from the traditional approach.
One divergence is the introduction of R as part of the learning process. Many
have used statistical packages or spreadsheets as tools for teaching statistics.
Others have used R in advanced courses. I am not aware of attempts to use
R in introductory level courses. Indeed, mastering R requires much investment
of time and energy that may be distracting and counterproductive for learning
more fundamental issues. Yet, I believe that if one restricts the application of
R to a limited number of commands, the benefits that R provides outweigh the
difficulties that R engenders.
Another departure from the standard approach is the treatment of probability as part of the course. In this book I do not attempt to teach probability
as a subject matter, but only specific elements of it which I feel are essential
for understanding statistics. Hence, Kolmogorov’s Axioms are out as well as
attempts to prove basic theorems and a Balls and Urns type of discussion. On
the other hand, emphasis is given to the notion of a random variable and, in
that context, the sample space.
The first part of the book deals with descriptive statistics and provides probability concepts that are required for the interpretation of statistical inference.
Statistical inference is the subject of the second part of the book.
The first chapter is a short introduction to statistics and probability. Students are required to have access to R right from the start. Instructions regarding
the installation of R on a PC are provided.
The second chapter deals with data structures and variation. Chapter 3
provides numerical and graphical tools for presenting and summarizing the distribution of data.
The fundamentals of probability are treated in Chapters 4 to 7. The concept
of a random variable is presented in Chapter 4 and examples of special types of
random variables are discussed in Chapter 5. Chapter 6 deals with the Normal
iii
iv
PREFACE
random variable. Chapter 7 introduces sampling distribution and presents the
Central Limit Theorem and the Law of Large Numbers. Chapter 8 summarizes
the material of the first seven chapters and discusses it in the statistical context.
Chapter 9 starts the second part of the book and the discussion of statistical inference. It provides an overview of the topics that are presented in the
subsequent chapter. The material of the first half is revisited.
Chapters 10 to 12 introduce the basic tools of statistical inference, namely
point estimation, estimation with a confidence interval, and the testing of statistical hypothesis. All these concepts are demonstrated in the context of a single
measurements.
Chapters 13 to 15 discuss inference that involve the comparison of two measurements. The context where these comparisons are carried out is that of
regression that relates the distribution of a response to an explanatory variable.
In Chapter 13 the response is numeric and the explanatory variable is a factor
with two levels. In Chapter 14 both the response and the explanatory variable
are numeric and in Chapter 15 the response in a factor with two levels.
Chapter 16 ends the book with the analysis of two case studies. These
analyses require the application of the tools that are presented throughout the
book.
This book was originally written for a pair of courses in the University of the
People. As such, each part was restricted to 8 chapters. Due to lack of space,
some important material, especially the concepts of correlation and statistical
independence were omitted. In future versions of the book I hope to fill this
gap.
Large portions of this book, mainly in the first chapters and some of the
quizzes, are based on material from the online book “Collaborative Statistics”
by Barbara Illowsky and Susan Dean (Connexions, March 2, 2010. http://
cnx.org/content/col10522/1.37/). Most of the material was edited by this
author, who is the only person responsible for any errors that where introduced
in the process of editing.
Case studies that are presented in the second part of the book are taken
from Rice Virtual Lab in Statistics can be found in their Case Studies section.
The responsibility for mistakes in the analysis of the data, if such mistakes are
found, are my own.
I would like to thank my mother Ruth who, apart from giving birth, feeding
and educating me, has also helped to improve the pedagogical structure of this
text. I would like to thank also Gary Engstrom for correcting many of the
mistakes in English that I made.
This book is an open source and may be used by anyone who wishes to do so.
(Under the conditions of the Creative Commons Attribution License (CC-BY
3.0).))
Jerusalem, March 2011
Benjamin Yakir
Contents
Preface
iii
I
1
Introduction to Statistics
1 Introduction
1.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . .
1.2 Why Learn Statistics? . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 The R Programming Environment . . . . . . . . . . . . . . . . .
1.6.1 Some Basic R Commands . . . . . . . . . . . . . . . . . .
1.7 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
3
4
5
6
7
7
10
13
2 Sampling and Data Structures
15
2.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 15
2.2 The Sampled Data . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Variation in Data . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Variation in Samples . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Critical Evaluation . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Reading Data into R . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Saving the File and Setting the Working Directory . . . . 19
2.3.2 Reading a CSV File into R . . . . . . . . . . . . . . . . . . 23
2.3.3 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Descriptive Statistics
29
3.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 29
3.2 Displaying Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Measures of the Center of Data . . . . . . . . . . . . . . . . . . . 35
3.3.1 Skewness, the Mean and the Median . . . . . . . . . . . . 36
3.4 Measures of the Spread of Data . . . . . . . . . . . . . . . . . . . 38
v
vi
CONTENTS
3.5
3.6
Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
45
4 Probability
47
4.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 47
4.2 Different Forms of Variability . . . . . . . . . . . . . . . . . . . . 47
4.3 A Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Sample Space and Distribution . . . . . . . . . . . . . . . 54
4.4.2 Expectation and Standard Deviation . . . . . . . . . . . . 56
4.5 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Random Variables
65
5.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 65
5.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 The Binomial Random Variable . . . . . . . . . . . . . . . 66
5.2.2 The Poisson Random Variable . . . . . . . . . . . . . . . 71
5.3 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . 74
5.3.1 The Uniform Random Variable . . . . . . . . . . . . . . . 75
5.3.2 The Exponential Random Variable . . . . . . . . . . . . . 79
5.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 The Normal Random Variable
87
6.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 87
6.2 The Normal Random Variable . . . . . . . . . . . . . . . . . . . . 87
6.2.1 The Normal Distribution . . . . . . . . . . . . . . . . . . 88
6.2.2 The Standard Normal Distribution . . . . . . . . . . . . . 90
6.2.3 Computing Percentiles . . . . . . . . . . . . . . . . . . . . 92
6.2.4 Outliers and the Normal Distribution . . . . . . . . . . . 94
6.3 Approximation of the Binomial Distribution . . . . . . . . . . . . 96
6.3.1 Approximate Binomial Probabilities and Percentiles . . . 96
6.3.2 Continuity Corrections . . . . . . . . . . . . . . . . . . . . 97
6.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 The Sampling Distribution
105
7.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 105
7.2 The Sampling Distribution . . . . . . . . . . . . . . . . . . . . . 105
7.2.1 A Random Sample . . . . . . . . . . . . . . . . . . . . . . 106
7.2.2 Sampling From a Population . . . . . . . . . . . . . . . . 107
7.2.3 Theoretical Models . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Law of Large Numbers and Central Limit Theorem . . . . . . . . 115
7.3.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . 115
7.3.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . 116
7.3.3 Applying the Central Limit Theorem . . . . . . . . . . . . 119
7.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
CONTENTS
vii
8 Overview and Integration
125
8.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 125
8.2 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.3 Integrated Applications . . . . . . . . . . . . . . . . . . . . . . . 127
8.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.3.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.3.5 Example 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
II
Statistical Inference
137
9 Introduction to Statistical Inference
139
9.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 139
9.2 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.3 The Cars Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.4 The Sampling Distribution . . . . . . . . . . . . . . . . . . . . . 144
9.4.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.4.2 The Sampling Distribution . . . . . . . . . . . . . . . . . 145
9.4.3 Theoretical Distributions of Observations . . . . . . . . . 146
9.4.4 Sampling Distribution of Statistics . . . . . . . . . . . . . 147
9.4.5 The Normal Approximation . . . . . . . . . . . . . . . . . 148
9.4.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10 Point Estimation
159
10.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 159
10.2 Estimating Parameters . . . . . . . . . . . . . . . . . . . . . . . . 159
10.3 Estimation of the Expectation . . . . . . . . . . . . . . . . . . . . 160
10.3.1 The Accuracy of the Sample Average . . . . . . . . . . . 161
10.3.2 Comparing Estimators . . . . . . . . . . . . . . . . . . . . 164
10.4 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . 166
10.5 Estimation of Other Parameters . . . . . . . . . . . . . . . . . . 171
10.6 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11 Confidence Intervals
181
11.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 181
11.2 Intervals for Mean and Proportion . . . . . . . . . . . . . . . . . 181
11.2.1 Examples of Confidence Intervals . . . . . . . . . . . . . . 182
11.2.2 Confidence Intervals for the Mean . . . . . . . . . . . . . 183
11.2.3 Confidence Intervals for a Proportion . . . . . . . . . . . 187
11.3 Intervals for Normal Measurements . . . . . . . . . . . . . . . . . 188
11.3.1 Confidence Intervals for a Normal Mean . . . . . . . . . . 190
11.3.2 Confidence Intervals for a Normal Variance . . . . . . . . 192
11.4 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . 195
11.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
viii
CONTENTS
12 Testing Hypothesis
203
12.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 203
12.2 The Theory of Hypothesis Testing . . . . . . . . . . . . . . . . . 203
12.2.1 An Example of Hypothesis Testing . . . . . . . . . . . . . 204
12.2.2 The Structure of a Statistical Test of Hypotheses . . . . . 205
12.2.3 Error Types and Error Probabilities . . . . . . . . . . . . 208
12.2.4 p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
12.3 Testing Hypothesis on Expectation . . . . . . . . . . . . . . . . . 211
12.4 Testing Hypothesis on Proportion . . . . . . . . . . . . . . . . . . 218
12.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
13 Comparing Two Samples
227
13.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 227
13.2 Comparing Two Distributions . . . . . . . . . . . . . . . . . . . . 227
13.3 Comparing the Sample Means . . . . . . . . . . . . . . . . . . . . 229
13.3.1 An Example of a Comparison of Means . . . . . . . . . . 229
13.3.2 Confidence Interval for the Difference . . . . . . . . . . . 232
13.3.3 The t-Test for Two Means . . . . . . . . . . . . . . . . . . 235
13.4 Comparing Sample Variances . . . . . . . . . . . . . . . . . . . . 237
13.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14 Linear Regression
247
14.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 247
14.2 Points and Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
14.2.1 The Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . 248
14.2.2 Linear Equation . . . . . . . . . . . . . . . . . . . . . . . 251
14.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
14.3.1 Fitting the Regression Line . . . . . . . . . . . . . . . . . 253
14.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
14.4 R-squared and the Variance of Residuals . . . . . . . . . . . . . . 260
14.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15 A Bernoulli Response
281
15.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 281
15.2 Comparing Sample Proportions . . . . . . . . . . . . . . . . . . . 282
15.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 285
15.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
16 Case Studies
299
16.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 299
16.2 A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
16.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
16.3.1 Physicians’ Reactions to the Size of a Patient . . . . . . . 300
16.3.2 Physical Strength and Job Performance . . . . . . . . . . 306
16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.4.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . 313
16.4.2 Discussion in the Forum . . . . . . . . . . . . . . . . . . . 314
Part I
Introduction to Statistics
1
Chapter 1
Introduction
1.1
Student Learning Objectives
This chapter introduces the basic concepts of statistics. Special attention is
given to concepts that are used in the first part of this book, the part that
deals with graphical and numeric statistical ways to describe data (descriptive
statistics) as well as mathematical theory of probability that enables statisticians
to draw conclusions from data.
The course applies the widely used freeware programming environment for
statistical analysis, known as R. In this chapter we will discuss the installation
of the program and present very basic features of that system.
By the end of this chapter, the student should be able to:
• Recognize key terms in statistics and probability.
• Install the R program on an accessible computer.
• Learn and apply a few basic operations of the computational system R.
1.2
Why Learn Statistics?
You are probably asking yourself the question, “When and where will I use
statistics?”. If you read any newspaper or watch television, or use the Internet,
you will see statistical information. There are statistics about crime, sports,
education, politics, and real estate. Typically, when you read a newspaper
article or watch a news program on television, you are given sample information.
With this information, you may make a decision about the correctness of a
statement, claim, or “fact”. Statistical methods can help you make the “best
educated guess”.
Since you will undoubtedly be given statistical information at some point in
your life, you need to know some techniques to analyze the information thoughtfully. Think about buying a house or managing a budget. Think about your
chosen profession. The fields of economics, business, psychology, education, biology, law, computer science, police science, and early childhood development
require at least one course in statistics.
3
CHAPTER 1. INTRODUCTION
2
0
1
y = Frequency
3
4
4
5
5.5
6
6.5
7
8
9
x = Time
Figure 1.1: Frequency of Average Time (in Hours) Spent Sleeping per Night
Included in this chapter are the basic ideas and words of probability and
statistics. In the process of learning the first part of the book, and more so in
the second part of the book, you will understand that statistics and probability
work together.
1.3
Statistics
The science of statistics deals with the collection, analysis, interpretation, and
presentation of data. We see and use data in our everyday lives. To be able
to use data correctly is essential to many professions and is in your own best
self-interest.
For example, assume the average time (in hours, to the nearest half-hour) a
group of people sleep per night has been recorded. Consider the following data:
5, 5.5, 6, 6, 6, 6.5, 6.5, 6.5, 6.5, 7, 7, 8, 8, 9 .
In Figure 1.1 this data is presented in a graphical form (called a bar plot). A bar
plot consists of a number axis (the x-axis) and bars (vertical lines) positioned
1.4. PROBABILITY
5
above the number axis. The length of each bar corresponds to the number
of data points that obtain the given numerical value. In the given plot the
frequency of average time (in hours) spent sleeping per night is presented with
hours of sleep on the horizontal x-axis and frequency on vertical y-axis.
Think of the following questions:
• Would the bar plot constructed from data collected from a different group
of people look the same as or different from the example? Why?
• If one would have carried the same example in a different group with the
same size and age as the one used for the example, do you think the results
would be the same? Why or why not?
• Where does the data appear to cluster? How could you interpret the
clustering?
The questions above ask you to analyze and interpret your data. With this
example, you have begun your study of statistics.
In this course, you will learn how to organize and summarize data. Organizing and summarizing data is called descriptive statistics. Two ways to
summarize data are by graphing and by numbers (for example, finding an average). In the second part of the book you will also learn how to use formal
methods for drawing conclusions from “good” data. The formal methods are
called inferential statistics. Statistical inference uses probabilistic concepts to
determine if conclusions drawn are reliable or not.
Effective interpretation of data is based on good procedures for producing
data and thoughtful examination of the data. In the process of learning how
to interpret data you will probably encounter what may seem to be too many
mathematical formulae that describe these procedures. However, you should
always remember that the goal of statistics is not to perform numerous calculations using the formulae, but to gain an understanding of your data. The
calculations can be done using a calculator or a computer. The understanding
must come from you. If you can thoroughly grasp the basics of statistics, you
can be more confident in the decisions you make in life.
1.4
Probability
Probability is the mathematical theory used to study uncertainty. It provides
tools for the formalization and quantification of the notion of uncertainty. In
particular, it deals with the chance of an event occurring. For example, if the
different potential outcomes of an experiment are equally likely to occur then
the probability of each outcome is taken to be the reciprocal of the number of
potential outcomes. As an illustration, consider tossing a fair coin. There are
two possible outcomes – a head or a tail – and the probability of each outcome
is 1/2.
If you toss a fair coin 4 times, the outcomes may not necessarily be 2 heads
and 2 tails. However, if you toss the same coin 4,000 times, the outcomes will
be close to 2,000 heads and 2,000 tails. It is very unlikely to obtain more than
2,060 tails and it is similarly unlikely to obtain less than 1,940 tails. This is
consistent with the expected theoretical probability of heads in any one toss.
Even though the outcomes of a few repetitions are uncertain, there is a regular
6
CHAPTER 1. INTRODUCTION
pattern of outcomes when the number of repetitions is large. Statistics exploits
this pattern regularity in order to make extrapolations from the observed sample
to the entire population.
The theory of probability began with the study of games of chance such as
poker. Today, probability is used to predict the likelihood of an earthquake, of
rain, or whether you will get an “A” in this course. Doctors use probability
to determine the chance of a vaccination causing the disease the vaccination is
supposed to prevent. A stockbroker uses probability to determine the rate of
return on a client’s investments. You might use probability to decide to buy a
lottery ticket or not.
Although probability is instrumental for the development of the theory of
statistics, in this introductory course we will not develop the mathematical theory of probability. Instead, we will concentrate on the philosophical aspects of
the theory and use computerized simulations in order to demonstrate probabilistic computations that are applied in statistical inference.
1.5
Key Terms
In statistics, we generally want to study a population. You can think of a
population as an entire collection of persons, things, or objects under study.
To study the larger population, we select a sample. The idea of sampling is
to select a portion (or subset) of the larger population and study that portion
(the sample) to gain information about the population. Data are the result of
sampling from a population.
Because it takes a lot of time and money to examine an entire population,
sampling is a very practical technique. If you wished to compute the overall
grade point average at your school, it would make sense to select a sample of
students who attend the school. The data collected from the sample would
be the students’ grade point averages. In presidential elections, opinion poll
samples of 1,000 to 2,000 people are taken. The opinion poll is supposed to
represent the views of the people in the entire country. Manufacturers of canned
carbonated drinks take samples to determine if the manufactured 16 ounce
containers does indeed contain 16 ounces of the drink.
From the sample data, we can calculate a statistic. A statistic is a number
that is a property of the sample. For example, if we consider one math class to
be a sample of the population of all math classes, then the average number of
points earned by students in that one math class at the end of the term is an
example of a statistic. The statistic can be used as an estimate of a population
parameter. A parameter is a number that is a property of the population. Since
we considered all math classes to be the population, then the average number of
points earned per student over all the math classes is an example of a parameter.
One of the main concerns in the field of statistics is how accurately a statistic
estimates a parameter. The accuracy really depends on how well the sample
represents the population. The sample must contain the characteristics of the
population in order to be a representative sample.
Two words that come up often in statistics are average and proportion. If
you were to take three exams in your math classes and obtained scores of 86, 75,
and 92, you calculate your average score by adding the three exam scores and
dividing by three (your average score would be 84.3 to one decimal place). If, in
1.6. THE R PROGRAMMING ENVIRONMENT
7
your math class, there are 40 students and 22 are men and 18 are women, then
the proportion of men students is 22/40 and the proportion of women students
is 18/40. Average and proportion are discussed in more detail in later chapters.
1.6
The R Programming Environment
The R Programming Environment is a widely used open source system for statistical analysis and statistical programming. It includes thousands of functions
for the implementation of both standard and exotic statistical methods and it
is probably the most popular system in the academic world for the development
of new statistical tools. We will use R in order to apply the statistical methods that will be discussed in the book to some example data sets and in order
to demonstrate, via simulations, concepts associated with probability and its
application in statistics.
The demonstrations in the book involve very basic R programming skills and
the applications are implemented using, in most cases, simple and natural code.
A detailed explanation will accompany the code that is used.
Learning R, like the learning of any other programming language, can be
achieved only through practice. Hence, we strongly recommend that you not
only read the code presented in the book but also run it yourself, in parallel to
the reading of the provided explanations. Moreover, you are encouraged to play
with the code: introduce changes in the code and in the data and see how the
output changes as a result. One should not be afraid to experiment. At worst,
the computer may crash or freeze. In both cases, restarting the computer will
solve the problem . . .
You may download R from the R project home page http://www.r-project.
org and install it on the computer that you are using1 .
1.6.1
Some Basic R Commands
R is an object-oriented programming system. During the session you may create and manipulate objects by the use of functions that are part of the basic
installation. You may also use the R programming language. Most of the functions that are part of the system are themselves written in the R language and
one may easily write new functions or modify existing functions to suit specific
needs.
Let us start by opening the R Console window by double-clicking on the
R icon. Type in the R Console window, immediately after the “>” prompt,
the expression “1+2” and then hit the Return key. (Do not include the double
quotation in the expression that you type!):
> 1+2
[1] 3
>
The prompt “>” indicates that the system is ready to receive commands. Writing an expression, such as “1+2”, and hitting the Return key sends the expression
1 Detailed explanation of how to install the system on an XP Windows Operating System
may be found here: http://pluto.huji.ac.il/~msby/StatThink/install_R_WinXP.html.
8
CHAPTER 1. INTRODUCTION
to be executed. The execution of the expression may produce an object, in this
case an object that is composed of a single number, the number “3”.
Whenever required, the R system takes an action. If no other specifications
are given regarding the required action then the system will apply the preprogrammed action. This action is called the default action. In the case of
hitting the Return key after the expression that we wrote the default is to
display the produced object on the screen.
Next, let us demonstrate R in a more meaningful way by using it in order
to produce the bar-plot of Figure 1.1. First we have to input the data. We
will produce a sequence of numbers that form the data2 . For that we will use
the function “c” that combines its arguments and produces a sequence with the
arguments as the components of the sequence. Write the expression:
> c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)
at the prompt and hit return. The result should look like this:
> c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)
[1] 5.0 5.5 6.0 6.0 6.0 6.5 6.5 6.5 6.5 7.0 7.0 8.0 8.0 9.0
>
The function “c” is an example of an R function. A function has a name, “c”
in this case, that is followed by brackets that include the input to the function.
We call the components of the input the arguments of the function. Arguments
are separated by commas. A function produces an output, which is typically
an R object. In the current example an object of the form of a sequence was
created and, according to the default application of the system, was sent to the
screen and not saved.
If we want to create an object for further manipulation then we should save
it and give it a name. For example, it we want to save the vector of data under
the name “X” we may write the following expression at the prompt (and then
hit return):
> X
The arrow that appears after the “X” is produced by typing the less than key
“ x
Error: object “x” not found
An object named “x” does not exist in the R system and we have not created
such object. The object “X”, on the other hand, does exist.
Names of functions that are part of the system are fixed but you are free to
choose a name to objects that you create. For example, if one wants to create
10
CHAPTER 1. INTRODUCTION
an object by the name “my.vector” that contains the numbers 3, 7, 3, 3, and
-5 then one may write the expression “my.vector table(X)
X
5 5.5
6 6.5
1
1
3
4
7
2
8
2
9
1
Notice that the output of the function “table” is a table of the different levels
of the input vector and the frequency of each level. This output is yet another
type of an object.
The bar-plot of Figure 1.1 can be produced by the application of the function
“plot” to the object that is produced as an output of the function “table”:
> plot(table(X))
Observe that a graphical window was opened with the target plot. The plot that
appears in the graphical window should coincide with the plot in Figure 1.3.
This plot is practically identical to the plot in Figure 1.1. The only difference is
in the names given to the access. These names were changed in Figure 1.1 for
clarity.
Clearly, if one wants to produce a bar-plot to other numerical data all one has
to do is replace in the expression “plot(table(X))” the object “X” by an object
that contains the other data. For example, to plot the data in “my.vector” you
may use “plot(table(my.vector))”.
1.7
Solved Exercises
Question 1.1. A potential candidate for a political position in some state is
interested to know what are her chances to win the primaries of her party and be
selected as parties candidate for the position. In order to examine the opinions
of her party voters she hires the services of a polling agency. The polling is
conducted among 500 registered voters of the party. One of the questions that
the pollsters refers to the willingness of the voters to vote for a female candidate
for the job. Forty two percent of the people asked said that they prefer to have
a women running for the job. Thirty eight percent said that the candidate’s
gender is irrelevant. The rest prefers a male candidate. Which of the following
is (i) a population (ii) a sample (iii) a parameter and (iv) a statistic:
1. The 500 registered voters.
2. The percentage, among all registered voters of the given party, of those
that prefer a male candidate.
3. The number 42% that corresponds to the percentage of those that prefer
a female candidate.
4. The voters in the state that are registered to the given party.
11
2
0
1
table(X)
3
4
1.7. SOLVED EXERCISES
5
5.5
6
6.5
7
8
9
X
Figure 1.3: The Plot Produced by the Expression “plot(table(X))”
Solution (to Question 1.1.1): According to the information in the question
the polling was conducted among 500 registered voters. The 500 registered
voters corresponds to the sample.
Solution (to Question 1.1.2): The percentage, among all registered voters
of the given party, of those that prefer a male candidate is a parameter. This
quantity is a characteristic of the population.
Solution (to Question 1.1.3): It is given that 42% of the sample prefer a
female candidate. This quantity is a numerical characteristic of the data, of the
sample. Hence, it is a statistic.
Solution (to Question 1.1.4): The voters in the state that are registered to
the given party is the target population.
Question 1.2. The number of customers that wait in front of a coffee shop at
the opening was reported during 25 days. The results were:
4, 2, 1, 1, 0, 2, 1, 2, 4, 2, 5, 3, 1, 5, 1, 5, 1, 2, 1, 1, 3, 4, 2, 4, 3 .
CHAPTER 1. INTRODUCTION
4
0
2
table(n.cost)
6
8
12
0
1
2
3
4
5
n.cost
Figure 1.4: The Plot Produced by the Expression “plot(table(n.cost))”
1. Identify the number of days in which 5 costumers where waiting.
2. The number of waiting costumers that occurred the largest number of
times.
3. The number of waiting costumers that occurred the least number of times.
Solution (to Question 1.2): One may read the data into R and create a table
using the code:
> n.cost table(n.cost)
n.cost
0 1 2 3 4 5
1 8 6 3 4 3
For convenience, one may also create the bar plot of the data using the code:
> plot(table(n.cost))
1.8. SUMMARY
13
The bar plot is presented in Figure 1.4.
Solution (to Question 1.2.1): The number of days in which 5 costumers
where waiting is 3, since the frequency of the value “5” in the data is 3. That
can be seen from the table by noticing the number below value “5” is 3. It can
also be seen from the bar plot by observing that the hight of the bar above the
value “5” is equal to 3.
Solution (to Question 1.2.2): The number of waiting costumers that occurred the largest number of times is 1. The value ”1” occurred 8 times, more
than any other value. Notice that the bar above this value is the highest.
Solution (to Question 1.2.3): The value ”0”, which occurred only once,
occurred the least number of times.
1.8
Summary
Glossary
Data: A set of observations taken on a sample from a population.
Statistic: A numerical characteristic of the data. A statistic estimates the
corresponding population parameter. For example, the average number
of contribution to the course’s forum for this term is an estimate for the
average number of contributions in all future terms (parameter).
Statistics The science that deals with processing, presentation and inference
from data.
Probability: A mathematical field that models and investigates the notion of
randomness.
Discuss in the forum
A sample is a subgroup of the population that is supposed to represent the
entire population. In your opinion, is it appropriate to attempt to represent the
entire population only by a sample?
When you formulate your answer to this question it may be useful to come
up with an example of a question from you own field of interest one may want to
investigate. In the context of this example you may identify a target population
which you think is suited for the investigation of the given question. The appropriateness of using a sample can be discussed in the context of the example
question and the population you have identified.
14
CHAPTER 1. INTRODUCTION
Chapter 2
Sampling and Data
Structures
2.1
Student Learning Objectives
In this chapter we deal with issues associated with the data that is obtained from
a sample. The variability associated with this data is emphasized and critical
thinking about validity of the data encouraged. A method for the introduction
of data from an external source into R is proposed and the data types used by
R for storage are described. By the end of this chapter, the student should be
able to:
• Recognize potential difficulties with sampled data.
• Read an external data file into R.
• Create and interpret frequency tables.
2.2
The Sampled Data
The aim in statistics is to learn the characteristics of a population on the basis
of a sample selected from the population. An essential part of this analysis
involves consideration of variation in the data.
2.2.1
Variation in Data
Variation is given a central role in statistics. To some extent the assessment of
variation and the quantification of its contribution to uncertainties in making
inference is the statistician’s main concern.
Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less than 16 ounces of liquid. In one study, eight 16
ounce cans were measured and produced the following amount (in ounces) of
beverage:
15.8, 16.1, 15.2, 14.8, 15.8, 15.9, 16.0, 15.5 .
Measurements of the amount of beverage in a 16-ounce may vary because the
conditions of measurement varied or because the exact amount, 16 ounces of
15
16
CHAPTER 2. SAMPLING AND DATA STRUCTURES
liquid, was not put into the cans. Manufacturers regularly run tests to determine
if the amount of beverage in a 16-ounce can falls within the desired range.
Be aware that if an investigator collects data, the data may vary somewhat
from the data someone else is taking for the same purpose. This is completely
natural. However, if two investigators or more, are taking data from the same
source and get very different results, it is time for them to reevaluate their
data-collection methods and data recording accuracy.
2.2.2
Variation in Samples
Two or more samples from the same population, all having the same characteristics as the population, may nonetheless be different from each other. Suppose
Doreen and Jung both decide to study the average amount of time students
sleep each night and use all students at their college as the population. Doreen
may decide to sample randomly a given number of students from the entire
body of collage students. Jung, on the other hand, may decide to sample randomly a given number of classes and survey all students in the selected classes.
Doreen’s method is called random sampling whereas Jung’s method is called
cluster sampling. Doreen’s sample will be different from Jung’s sample even
though both samples have the characteristics of the population. Even if Doreen
and Jung used the same sampling method, in all likelihood their samples would
be different. Neither would be wrong, however.
If Doreen and Jung took larger samples (i.e. the number of data values
is increased), their sample results (say, the average amount of time a student
sleeps) would be closer to the actual population average. But still, their samples
would be, most probably, different from each other.
The size of a sample (often called the number of observations) is important.
The examples you have seen in this book so far have been small. Samples of only
a few hundred observations, or even smaller, are sufficient for many purposes.
In polling, samples that are from 1200 to 1500 observations are considered large
enough and good enough if the survey is random and is well done. The theory of
statistical inference, that is the subject matter of the second part of this book,
provides justification for these claims.
2.2.3
Frequency
The primary way of summarizing the variability of data is via the frequency
distribution. Consider an example. Twenty students were asked how many
hours they worked per day. Their responses, in hours, are listed below:
5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3 .
Let us create an R object by the name “work.hours” that contains these data:
> work.hours table(work.hours)
work.hours
2 3 4 5 6 7
3 5 3 6 2 1
2.2. THE SAMPLED DATA
17
Recall that the function “table” takes as input a sequence of data and produces
as output the frequencies of the different values.
We may have a clearer understanding of the meaning of the output of the
function “table” if we presented outcome as a frequency listing the different
data values in ascending order and their frequencies. For that end we may apply
the function “data.frame” to the output of the “table” function and obtain:
> data.frame(table(work.hours))
work.hours Freq
2
2
3
3
3
5
4
4
3
5
5
6
6
6
2
7
7
1
A frequency is the number of times a given datum occurs in a data set.
According to the table above, there are three students who work 2 hours, five
students who work 3 hours, etc. The total of the frequency column, 20, represents the total number of students included in the sample.
The function “data.frame” transforms its input into a data frame, which is
the standard way of storing statistical data. We will introduce data frames in
more detail in Section 2.3 below.
A relative frequency is the fraction of times a value occurs. To find the
relative frequencies, divide each frequency by the total number of students in
the sample – 20 in this case. Relative frequencies can be written as fractions,
percents, or decimals.
As an illustration let us compute the relative frequencies in our data:
> freq freq
work.hours
2 3 4 5 6 7
3 5 3 6 2 1
> sum(freq)
[1] 20
> freq/sum(freq)
work.hours
2
3
4
5
6
7
0.15 0.25 0.15 0.30 0.10 0.05
We stored the frequencies in an object called “freq”. The content of the object
are the frequencies 3, 5, 3, 6, 2 and 1. The function “sum” sums the components
of its input. The sum of the frequencies is the sample size , the total number of
students that responded to the survey, which is 20. Hence, when we apply the
function “sum” to the object “freq” we get 20 as an output.
The outcome of dividing an object by a number is a division of each element in the object by the given number. Therefore, when we divide “freq” by
“sum(freq)” (the number 20) we get a sequence of relative frequencies. The
first entry to this sequence is 3/20 = 0.15, the second entry is 5/20 = 0.25, and
the last entry is 1/20 = 0.05. The sum of the relative frequencies should always
be equal to 1:
18
CHAPTER 2. SAMPLING AND DATA STRUCTURES
> sum(freq/sum(freq))
[1] 1
The cumulative relative frequency is the accumulation of previous relative
frequencies. To find the cumulative relative frequencies, add all the previous
relative frequencies to the relative frequency of the current value. Alternatively,
we may apply the function “cumsum” to the sequence of relative frequencies:
> cumsum(freq/sum(freq))
2
3
4
5
6
7
0.15 0.40 0.55 0.85 0.95 1.00
Observe that the cumulative relative frequency of the smallest value 2 is the
frequency of that value (0.15). The cumulative relative frequency of the second
value 3 is the sum of the relative frequency of the smaller value (0.15) and
the relative frequency of the current value (0.25), which produces a total of
0.15 + 0.25 = 0.40. Likewise, for the third value 4 we get a cumulative relative
frequency of 0.15 + 0.25 + 0.15 = 0.55. The last entry of the cumulative relative
frequency column is one, indicating that one hundred percent of the data has
been accumulated.
The computation of the cumulative relative frequency was carried out with
the aid of the function “cumsum”. This function takes as an input argument a
numerical sequence and produces as output a numerical sequence of the same
length with the cumulative sums of the components of the input sequence.
2.2.4
Critical Evaluation
Inappropriate methods of sampling and data collection may produce samples
that do not represent the target population. A naı̈ve application of statistical
analysis to such data may produce misleading conclusions.
Consequently, it is important to evaluate critically the statistical analyses
we encounter before accepting the conclusions that are obtained as a result of
these analyses. Common problems that occurs in data that one should be aware
of include:
Problems with Samples: A sample should be representative of the population. A sample that is not representative of the population is biased.
Biased samples may produce results that are inaccurate and not valid.
Data Quality: Avoidable errors may be introduced to the data via inaccurate
handling of forms, mistakes in the input of data, etc. Data should be
cleaned from such errors as much as possible.
Self-Selected Samples: Responses only by people who choose to respond,
such as call-in surveys, that are often biased.
Sample Size Issues: Samples that are too small may be unreliable. Larger
samples, when possible, are better. In some situations, small samples are
unavoidable and can still be used to draw conclusions. Examples: Crash
testing cars, medical testing for rare conditions.
Undue Influence: Collecting data or asking questions in a way that influences
the response.
2.3. READING DATA INTO R
19
Causality: A relationship between two variables does not mean that one causes
the other to occur. They may both be related (correlated) because of their
relationship to a third variable.
Self-Funded or Self-Interest Studies: A study performed by a person or
organization in order to support their claim. Is the study impartial? Read
the study carefully to evaluate the work. Do not automatically assume
that the study is good but do not automatically assume the study is bad
either. Evaluate it on its merits and the work done.
Misleading Use of Data: Improperly displayed graphs and incomplete data.
Confounding: Confounding in this context means confusing. When the effects
of multiple factors on a response cannot be separated. Confounding makes
it difficult or impossible to draw valid conclusions about the effect of each
factor.
2.3
Reading Data into R
In the examples so far the size of the data set was very small and we were able
to input the data directly into R with the use of the function “c”. In more
practical settings the data sets to be analyzed are much larger and it is very
inefficient to enter them manually. In this section we learn how to upload data
from a file in the Comma Separated Values (CSV) format.
The file “ex1.csv” contains data on the sex and height of 100 individuals.
This file is given in the CSV format. The file can be found on the internet
at http://pluto.huji.ac.il/~msby/StatThink/Datasets/ex1.csv. We will
discuss the process of reading data from a file into R and use this file as an
illustration.
2.3.1
Saving the File and Setting the Working Directory
Before the file is read into R you may find it convenient to obtain a copy of the
file and store it in some directory on the computer and read the file from that
directory. We recommend that you create a special directory in which you keep
all the material associated with this course. In the explanations provided below
we assume that the directory to which the file is stored in called “IntroStat”.
(See Figure 2.1)
Files in the CSV format are ordinary text files. They can be created manually
or as a result of converting data stored in a different format into this particular
format. A convenient way to produce, browse and edit CSV files is by the use
of a standard electronic spreadsheet programs such as Excel or Calc. The Excel
spreadsheet is part of the Microsoft’s Office suite. The Calc spreadsheet is part
of OpenOffice suite that is freely distributed by the OpenOffice Organization.
Opening a CSV file by a spreadsheet program displays a spreadsheet with
the content of the file. Values in the cells of the spreadsheet may be modified
directly. (However, when saving, one should pay attention to save the file in
the CVS format.) Similarly, new CSV files may be created by the entering of
the data in an empty spreadsheet. The first row should include the name of
the variable, preferably as a single character string with no empty spaces. The
20
CHAPTER 2. SAMPLING AND DATA STRUCTURES
Figure 2.1: The File “read.csv”
following rows may contain the data values associated with this variable. When
saving, the spreadsheet should be saved in the CSV format by the use of the
“Save by name” dialog and choosing there the option of CSV in the “Save by
Type” selection.
After saving a file with the data in a directory, R should be notified where
the file is located in order to be able to read it. A simple way of doing so is
by setting the directory with the file as R’s working directory. The working
directory is the first place R is searching for files. Files produced by R are saved
in that directory. In Windows, during an active R session, one may set the
working directory to be some target directory with the “File/Change Dir…”
dialog. This dialog is opened by selecting the option “File” on the left hand
side of the ruler on the top of the R Console window. Selecting the option of
“Change Dir…” in the ruler that opens will start the dialog. (See Figure 2.2.)
Browsing via this dialog window to the directory of choice, selecting it, and
approving the selection by clicking the “OK” bottom in the dialog window will
set the directory of choice as the working directory of R.
Rather than changing the working directory every time that R is opened one
may set a selected directory to be R’s working directory on opening. Again, we
demonstrate how to do this on the XP Windows operating system.
The R icon was added to the Desktop when the R system was installed.
The R Console is opened by double-clicking on this icon. One may change
the properties of the icon so that it sets a directory of choice as R’s working
directory.
In order to do so click on the icon with the mouse’s right bottom. A menu
2.3. READING DATA INTO R
21
Figure 2.2: Changing The Working Directory
opens in which you should select the option “Properties”. As a result, a dialog
window opens. (See Figure 2.3.) Look at the line that starts with the words
“Start in” and continues with a name of a directory that is the current working
directory. The name of this directory is enclosed in double quotes and is given
with it’s full path, i.e. its address on the computer. This name and path should
be changed to the name and path of the directory that you want to fix as the
new working directory.
Consider again Figure 2.1. Imagine that one wants to fix the directory that
contains the file “ex1.csv” as the permanent working directory. Notice that
the full address of the directory appears at the “Address” bar on the top of
the window. One may copy the address and paste it instead of the name of the
current working directory that is specified in the “Properties” dialog of the
R icon. One should make sure that the address to the new directory is, again,
placed between double-quotes. (See in Figure 2.4 the dialog window after the
changing the address of the working directory. Compare this to Figure 2.3 of
the window before the change.) After approving the change by clicking the
“OK” bottom the new working directory is set. Henceforth, each time that the
R Console is opened by double-clicking the icon it will have the designated
directory as its working directory.
In the rest of this book we assume that a designated directory is set as R’s
working directory and that all external files that need to be read into R, such
as “ex1.csv” for example, are saved in that working directory. Once a working
directory has been set then the history of subsequent R sessions is stored in that
directory. Hence, if you choose to save the image of the session when you end
the session then objects created in the session will be uploaded the next time
22
CHAPTER 2. SAMPLING AND DATA STRUCTURES
Figure 2.3: Setting the Working Directory (Before the Change)
Figure 2.4: Setting the Working Directory (After the Change)
2.3. READING DATA INTO R
23
the R Console is opened.
2.3.2
Reading a CSV File into R
Now that a copy of the file “ex1.csv” is placed in the working directory we
would like to read its content into R. Reading of files in the CSV format can be
carried out with the R function “read.csv”. To read the file of the example we
run the following line of code in the R Console window:
> ex.1 ex.1
id
sex height
1
5696379 FEMALE
182
2
3019088
MALE
168
3
2038883
MALE
172
4
1920587 FEMALE
154
5
6006813
MALE
174
6
4055945 FEMALE
176
.
.
.
.
.
.
.
.
.
.
.
.
98 9383288
MALE
195
99 1582961 FEMALE
129
100 9805356
MALE
172
>
(Noticed that we have erased the middle rows. In the R Console window you
should obtain the full table. However, in order to see the upper part of the
output you may need to scroll up the window.)
The object “ex.1”, the output of the function “read.csv” is a data frame.
Data frames are the standard tabular format of storing statistical data. The
columns of the table are called variables and correspond to measurements. In
this example the three variables are:
id: A 7 digits number that serves as a unique identifier of the subject.
sex: The sex of each subject. The values are either “MALE” or “FEMALE”.
height: The height (in centimeter) of each subject. A numerical value.
1 If the file is located in a different directory then the complete address, including the path
to the file, should be provided. The file need not reside on the computer. One may provide,
for example, a URL (an internet address) as the address. Thus, instead of saving the file of the
example on the computer one may read its content into an R object by using the line of code
“ex.1 freq cumsum(freq)
1 2 3 4 5 6 7
4 7 18 28 32 38 45
1. How many cows were involved in this study?
2. How many cows gave birth to a total of 4 calves?
3. What is the relative frequency of cows that gave birth to at least 4 calves?
Solution (to Question 2.2.1): The total number of cows that were involved
in this study is 45. The object “freq” contain the table of frequency of the
cows, divided according to the number of calves that they had. The cumulative
frequency of all the cows that had 7 calves or less, which includes all cows in
the study, is reported under the number “7” in the output of the expression
“cumsum(freq)”. This number is 45.
Solution (to Question 2.2.2): The number of cows that gave birth to a total
of 4 calves is 10. Indeed, the cumulative frequency of cows that gave birth to
4 calves or less is 28. The cumulative frequency of cows that gave birth to 3
calves or less is 18. The frequency of cows that gave birth to exactly 4 calves is
the difference between these two numbers: 28 – 18 = 10.
Solution (to Question 2.2.3): The relative frequency of cows that gave birth
to at least 4 calves is 27/45 = 0.6. Notice that the cumulative frequency of
cows that gave at most 3 calves is 18. The total number of cows is 45. Hence,
the number of cows with 4 or more calves is the difference between these two
numbers: 45 – 18 = 27. The relative frequency of such cows is the ratio between
this number and the total number of cows: 27/45 = 0.6.
2.5. SUMMARY
2.5
27
Summary
Glossary
Population: The collection, or set, of all individuals, objects, or measurements
whose properties are being studied.
Sample: A portion of the population understudy. A sample is representative
if it characterizes the population being studied.
Frequency: The number of times a value occurs in the data.
Relative Frequency: The ratio between the frequency and the size of data.
Cumulative Relative Frequency: The term applies to an ordered set of data
values from smallest to largest. The cumulative relative frequency is the
sum of the relative frequencies for all values that are less than or equal to
the given value.
Data Frame: A tabular format for storing statistical data. Columns correspond to variables and rows correspond to observations.
Variable: A measurement that may be carried out over a collection of subjects.
The outcome of the measurement may be numerical, which produces a
quantitative variable; or it may be non-numeric, in which case a factor is
produced.
Observation: The evaluation of a variable (or variables) for a given subject.
CSV Files: A digital format for storing data frames.
Factor: Qualitative data that is associated with categorization or the description of an attribute.
Quantitative: Data generated by numerical measurements.
Discuss in the forum
Factors are qualitative data that are associated with categorization or the description of an attribute. On the other hand, numeric data are generated by
numerical measurements. A common practice is to code the levels of factors
using numerical values. What do you think of this practice?
In the formulation of your answer to the question you may think of an
example of factor variable from your own field of interest. You may describe a
benefit or a disadvantage that results from the use of a numerical values to code
the level of this factor.
28
CHAPTER 2. SAMPLING AND DATA STRUCTURES
Chapter 3
Descriptive Statistics
3.1
Student Learning Objectives
This chapter deals with numerical and graphical ways to describe and display
data. This area of statistics is called descriptive statistics. You will learn to
calculate and interpret these measures and graphs. By the end of this chapter,
you should be able to:
• Use histograms and box plots in order to display data graphically.
• Calculate measures of central location: mean and median.
• Calculate measures of the spread: variance, standard deviation, and interquartile range.
• Identify outliers, which are values that do not fit the rest of the distribution.
3.2
Displaying Data
Once you have collected data, what will you do with it? Data can be described
and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the
house prices, so you may ask your real estate agent to give you a sample data
set of prices. Looking at all the prices in the sample is often overwhelming. A
better way may be to look at the median price and the variation of prices. The
median and variation are just two ways that you will learn to describe data.
Your agent might also provide you with a graph of the data.
A statistical graph is a tool that helps you learn about the shape of the
distribution of a sample. The graph can be a more effective way of presenting
data than a mass of numbers because we can see where data clusters and where
there are only a few data values. Newspapers and the Internet use graphs to
show trends and to enable readers to compare facts and figures quickly.
Statisticians often start the analysis by graphing the data in order to get an
overall picture of it. Afterwards, more formal tools may be applied.
In the previous chapters we used the bar plot, where bars that indicate the
frequencies in the data of values are placed over these values. In this chapter
29
30
CHAPTER 3. DESCRIPTIVE STATISTICS
15
0
5
10
Frequency
20
25
Histogram of ex.1$height
120
140
160
180
200
ex.1$height
Figure 3.1: Histogram of Height
our emphasis will be on histograms and box plots, which are other types of
plots. Some of the other types of graphs that are frequently used, but will not
be discussed in this book, are the stem-and-leaf plot, the frequency polygon
(a type of broken line graph) and the pie charts. The types of plots that will
be discussed and the types that will not are all tightly linked to the notion of
frequency of the data that was introduced in Chapter 2 and intend to give a
graphical representation of this notion.
3.2.1
Histograms
The histogram is a frequently used method for displaying the distribution of
continuous numerical data. An advantage of a histogram is that it can readily
display large data sets. A rule of thumb is to use a histogram when the data
set consists of 100 values or more.
One may produce a histogram in R by the application of the function “hist”
to a sequence of numerical data. Let us read into R the data frame “ex.1” that
contains data on the sex and height and create a histogram of the heights:
> ex.1 hist(ex.1$height)
The outcome of the function is a plot that apears in the graphical window and
is presented in Figure 3.1.
The data set, which is the content of the CSV file “ex1.csv”, was used in
Chapter 2 in order to demonstrate the reading of data that is stored in a external
file into R. The first line of the above script reads in the data from “ex1.csv”
into a data frame object named “ex.1” that maintains the data internally in R.
The second line of the script produces the histogram. We will discuss below the
code associated with this second line.
A histogram consists of contiguous boxes. It has both a horizontal axis and
a vertical axis. The horizontal axis is labeled with what the data represents (the
height, in this example). The vertical axis presents frequencies and is labeled
“Frequency”. By the examination of the histogram one can appreciate the shape
of the data, the center, and the spread of the data.
The histogram is constructed by dividing the range of the data (the x-axis)
into equal intervals, which are the bases for the boxes. The height of each box
represents the count of the number of observations that fall within the interval.
For example, consider the box with the base between 160 and 170. There is a
total of 19 subjects with height larger that 160 but no more than 170 (that is,
160 < height ≤ 170). Consequently, the height of that box1 is 19.
The input to the function “hist” should be a sequence of numerical values.
In principle, one may use the function “c” to produce a sequence of data and
apply the histogram plotting function to the output of the sequence producing
function. However, in the current case we have already the data stored in the
data frame “ex.1”, all we need to learn is how to extract that data so it can be
used as input to the function “hist” that plots the histogram.
Notice the structure of the input that we have used in order to construct
the histogram of the variable “height” in the “ex.1” data frame. One may
address the variable “variable.name” in the data frame “dataframe.name”
using the format: “dataframe.name$variable.name”. Indeed, when we type
the expression “ex.1$height” we get as an output the values of the variable
“height” from the given data frame:
> ex.1$height
[1] 182 168 172 154 174 176 193 156 157 186 143 182 194 187 171
[16] 178 157 156 172 157 171 164 142 140 202 176 165 176 175 170
[31] 169 153 169 158 208 185 157 147 160 173 164 182 175 165 194
[46] 178 178 186 165 180 174 169 173 199 163 160 172 177 165 205
[61] 193 158 180 167 165 183 171 191 191 152 148 176 155 156 177
[76] 180 186 167 174 171 148 153 136 199 161 150 181 166 147 168
[91] 188 170 189 117 174 187 141 195 129 172
This is a numeric sequence and can serve as the input to a function that expects a
numeric sequence as input, a function such as “hist”. (But also other functions,
for example, “sum” and “cumsum”.)
1 In some books an histogram is introduced as a form of a density. In densities the area of
the box represents the frequency or the relative frequency. In the current example the height
would have been 19/10 = 1.9 if the area of the box would have represented the frequency
and it would have been (19/100)/10 = 0.019 if the area of the box would have represented
the relative frequency. However, in this book we follow the default of R in which the height
represents the frequency.
32
CHAPTER 3. DESCRIPTIVE STATISTICS
There are 100 observations in the variable “ex.1$height”. So many observations cannot be displayed on the screen on one line. Consequently, the
sequence of the data is wrapped and displayed over several lines. Notice that
the square brackets on the left hand side of each line indicate the position in
the sequence of the first value on that line. Hence, the number on the first
line is “[1]”. The number on the second line is “[16]”, since the second line
starts with the 16th observation in the display given in the book. Notice, that
numbers in the square brackets on your R Console window may be different,
depending on the setting of the display on your computer.
3.2.2
Box Plots
The box plot, or box-whisker plot, gives a good graphical overall impression of
the concentration of the data. It also shows how far from most of the data the
extreme values are. In principle, the box plot is constructed from five values: the
smallest value, the first quartile, the median, the third quartile, and the largest
value. The median, the first quartile, and the third quartile will be discussed
here, and then once more in the next section.
The median, a number, is a way of measuring the “center” of the data. You
can think of the median as the “middle value,” although it does not actually
have to be one of the observed values. It is a number that separates ordered
data into halves. Half the values are the same size or smaller than the median
and half the values are the same size or larger than it. For example, consider
the following data that contains 14 values:
1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1 .
Ordered, from smallest to largest, we get:
1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 .
The median is between the 7th value, 6.8, and the 8th value 7.2. To find the
median, add the two values together and divide by 2:
6.8 + 7.2
=7
2
The median is 7. Half of the values are smaller than 7 and half of the values
are larger than 7.
Quartiles are numbers that separate the data into quarters. Quartiles may
or may not be part of the data. To find the quartiles, first find the median or
second quartile. The first quartile is the middle value of the lower half of the
data and the third quartile is the middle value of the upper half of the data.
For illustration consider the same data set from above:
1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 .
The median or second quartile is 7. The lower half of the data is:
1, 1, 2, 2, 4, 6, 6.8 .
The middle value of the lower half is 2. The number 2, which is part of the data
in this case, is the first quartile which is denoted Q1. One-fourth of the values
are the same or less than 2 and three-fourths of the values are more than 2.
33
2
4
6
8
10
3.2. DISPLAYING DATA
Figure 3.2: Box Plot of the Example
The upper half of the data is:
7.2, 8, 8.3, 9, 10, 10, 11.5
The middle value of the upper half is 9. The number 9 is the third quartile
which is denoted Q3. Three-fourths of the values are less than 9 and one-fourth
of the values2 are more than 9.
Outliers are values that do not fit with the rest of the data and lie outside of
the normal range. Data points with values that are much too large or much too
small in comparison to the vast majority of the observations will be identified
as outliers. In the context of the construction of a box plot we identify potential
outliers with the help of the inter-quartile range (IQR). The inter-quartile range
is the distance between the third quartile (Q3) and the first quartile (Q1), i.e.,
IQR = Q3 − Q1. A data point that is larger than the third quartile plus 1.5
times the inter-quartile range will be marked as a potential outlier. Likewise,
a data point smaller than the first quartile minus 1.5 times the inter-quartile
2 The actual computation in R of the first quartile and the third quartile may vary slightly
from the description given here, depending on the exact structure of the data.
34
CHAPTER 3. DESCRIPTIVE STATISTICS
range will also be so marked. Outliers may have a substantial effect on the
outcome of statistical analysis, therefore it is important that one is alerted to
the presence of outliers.
In the running example we obtained an inter-quartile range of size 9-2=7.
The upper threshold for defining an outlier is 9 + 1.5 × 7 = 19.5 and the lower
threshold is 2 − 1.5 × 7 = −8.5. All data points are within the two thresholds,
hence there are no outliers in this data.
In the construction of a box plot one uses a vertical rectangular box and two
vertical “whiskers” that extend from the ends of the box to the smallest and
largest data values that are not outliers. Outlier values, if any exist, are marked
as points above or blow the endpoints of the whiskers. The smallest and largest
non-outlier data values label the endpoints of the axis. The first quartile marks
one end of the box and the third quartile marks the other end of the box. The
central 50% of the data fall within the box.
One may produce a box plot with the aid of the function “boxplot”. The
input to the function is a sequence of numerical values and the output is a plot.
As an example, let us produce the box plot of the 14 data points that were used
as an illustration:
> boxplot(c(1,11.5,6,7.2,4,8,9,10,6.8,8.3,2,2,10,1))
The resulting box plot is presented in Figure 3.2. Observe that the end
points of the whiskers are 1, for the minimal value, and 11.5 for the largest
value. The end values of the box are 9 for the third quartile and 2 for the first
quartile. The median 7 is marked inside the box.
Next, let us examine the box plot for the height data:
> boxplot(ex.1$height)
The resulting box plot is presented in Figure 3.3. In order to assess the plot let
us compute quartiles of the variable:
> summary(ex.1$height)
Min. 1st Qu. Median
117.0
158.0
171.0
Mean 3rd Qu.
170.1
180.2
Max.
208.0
The function “summary”, when applied to a numerical sequence, produce the
minimal and maximal entries, as well the first, second and third quartiles (the
second is the Median). It also computes the average of the numbers (the Mean),
which will be discussed in the next section.
Let us compare the results with the plot in Figure 3.3. Observe that the
median 171 coincides with the thick horizontal line inside the box and that the
lower end of the box coincides with first quartile 158.0 and the upper end with
180.2, which is the third quartile. The inter-quartile range is 180.2 − 158.0 =
22.2. The upper threshold is 180.2 + 1.5 × 22.2 = 213.5. This threshold is
larger than the largest observation (208.0). Hence, the largest observation is
not an outlier and it marks the end of the upper whisker. The lower threshold
is 158.0 − 1.5 × 22.2 = 124.7. The minimal observation (117.0) is less than this
threshold. Hence it is an outlier and it is marked as a point below the end of the
lower whisker. The second smallest observation is 129. It lies above the lower
threshold and it marks the end point of the lower whisker.
35
120
140
160
180
200
3.3. MEASURES OF THE CENTER OF DATA
●
Figure 3.3: Box Plot of Height
3.3
Measures of the Center of Data
The two most widely used measures of the central location of the data are the
mean (average) and the median. To calculate the average weight of 50 people
one should add together the 50 weights and divide the result by 50. To find
the median weight of the same 50 people, one may order the data and locate
a number that splits the data into two equal parts. The median is generally a
better measure of the center when there are extreme values or outliers because
it is not affected by the precise numerical values of the outliers. Nonetheless,
the mean is the most commonly used measure of the center.
We shall use small Latin letters such as x to mark the sequence of data.
In such a case we may mark the sample mean by placing a bar over the x: x̄
(pronounced “x bar”).
The mean can be calculated by averaging the data points or it also can be
calculated with the relative frequencies of the values that are present in the data.
In the latter case one multiplies each distinct value by its relative frequency and
then sum the products across all values. To see that both ways of calculating
36
CHAPTER 3. DESCRIPTIVE STATISTICS
4
2
0
Frequency
6
Histogram of x
4
5
6
7
8
9
10
8
9
10
8
9
10
x
4
2
0
Frequency
6
Histogram of x
4
5
6
7
x
4
2
0
Frequency
6
Histogram of x
4
5
6
7
x
Figure 3.4: Three Histograms
the mean are the same, consider the data:
1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4 .
In the first way of calculating the mean we get:
x̄ =
1+1+1+2+2+3+4+4+4+4+4
= 2.7 .
11
Alternatively, we may note that the distinct values in the sample are 1, 2, 3,
and 4 with relative frequencies of 3/11, 2/11, 1/11 and 5/11, respectively. The
alternative method of computation produces:
x̄ = 1 ×
3.3.1
3
2
1
5
+2×
+3×
+4×
= 2.7 .
11
11
11
11
Skewness, the Mean and the Median
Consider the following data set:
4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10
3.3. MEASURES OF THE CENTER OF DATA
37
This data produces the upper most histogram in Figure 3.4. Each interval has
width one and each value is located at the middle of an interval. The histogram
displays a symmetrical distribution of data. A distribution is symmetrical if a
vertical line can be drawn at some point in the histogram such that the shape
to the left and to the right of the vertical line are mirror images of each other.
Let us compute the mean and the median of this data:
> x mean(x)
[1] 7
> median(x)
[1] 7
The mean and the median are each 7 for these data. In a perfectly symmetrical
distribution, the mean and the median are the same3 .
The functions “mean” and “median” were used in order to compute the mean
and median. Both functions expect a numeric sequence as an input and produce
the appropriate measure of centrality of the sequence as an output.
The histogram for the data:
4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8
is not symmetrical and is displayed in the middle of Figure 3.4. The right-hand
side seems “chopped off” compared to the left side. The shape of the distribution
is called skewed to the left because it is pulled out towards the left.
Let us compute the mean and the median for this data:
> x mean(x)
[1] 6.416667
> median(x)
[1] 7
(Notice that the original data is replaced by the new data when object x is
reassigned.) The median is still 7, but the mean is less than 7. The relation
between the mean and the median reflects the skewing.
Consider yet another set of data:
6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10
The histogram for the data is also not symmetrical and is displayed at the
bottom of Figure 3.4. Notice that it is skewed to the right. Compute the mean
and the median:
> x mean(x)
[1] 7.583333
> median(x)
[1] 7
3 In the case of a symmetric distribution the vertical line of symmetry is located at the
mean, which is also equal to the median.
38
CHAPTER 3. DESCRIPTIVE STATISTICS
The median is yet again equal to 7, but this time the mean is greater than 7.
Again, the mean reflects the skewing.
In summary, if the distribution of data is skewed to the left then the mean
is less than the median. If the distribution of data is skewed to the right then
the median is less than the mean.
Examine the data on the height in “ex.1”:
> mean(ex.1$height)
[1] 170.11
> median(ex.1$height)
[1] 171
Observe that the histogram of the height (Figure 3.1) is skewed to the left. This
is consistent with the fact that the mean is less than the median.
3.4
Measures of the Spread of Data
One measure of the spread of the data is the inter-quartile range that was
introduced in the context of the box plot. However, the most important measure
of spread is the standard deviation.
Before dealing with the standard deviation let us discuss the calculation of
the variance. If xi is a data value for subject i and x̄ is the sample mean,
then xi − x̄ is called the deviation of subject i from the mean, or simply the
deviation. In a data set, there are as many deviations as there are data values.
The variance is in principle the average of the squares of the deviations.
Consider the following example: In a fifth grade class, the teacher was interested in the average age and the standard deviation of the ages of her students.
Here are the ages of her students to the nearest half a year:
9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11,
11.5, 11.5, 11.5 .
In order to explain the computation of the variance of these data let us create
an object x that contains the data:
> x length(x)
[1] 20
Pay attention to the fact that we did not write the “+” at the beginning of the
second line. That symbol was produced by R when moving to the next line to
indicate that the expression is not complete yet and will not be executed. Only
after inputting the right bracket and the hitting of the Return key does R carry
out the command and creates the object “x”. When you execute this example
yourself on your own computer make sure not to copy the “+” sign. Instead, if
you hit the return key after the last comma on the first line, the plus sign will
be produced by R as a new prompt and you can go on typing in the rest of the
numbers.
The function “length” returns the length of the input sequence. Notice that
we have a total of 20 data points.
The next step involves the computation of the deviations:
3.4. MEASURES OF THE SPREAD OF DATA
39
> x.bar x.bar
[1] 10.525
> x – x.bar
[1] -1.525 -1.025 -1.025 -0.525 -0.525 -0.525 -0.525 -0.025
[9] -0.025 -0.025 -0.025 0.475 0.475 0.475 0.475 0.475
[17] 0.475 0.975 0.975 0.975
The average of the observations is equal to 10.525 and when we delete this
number from each of the components of the sequence x we obtain the deviations.
For example, the first deviation is obtained as 9 – 10.525 = -1.525, the second
deviation is 9.5 – 10.525 = -1.025, and so forth. The 20th deviation is 11.5 10.525 = 0.975, and this is the last number that is presented in the output.
From a more technical point of view observe that the expression that computed the deviations, “x – x.bar”, involved the deletion of a single value
(x.bar) from a sequence with 20 values (x). The expression resulted in the
deletion of the value from each component of the sequence. This is an example
of the general way by which R operates on sequences. The typical behavior of
R is to apply an operation to each component of the sequence.
As yet another illustration of this property consider the computation of the
squares of the deviations:
> (x – x.bar)^2
[1] 2.325625 1.050625 1.050625 0.275625 0.275625 0.275625
[7] 0.275625 0.000625 0.000625 0.000625 0.000625 0.225625
[13] 0.225625 0.225625 0.225625 0.225625 0.225625 0.950625
[19] 0.950625 0.950625
Recall that “x – x.bar” is a sequence of length 20. We apply the square function to this sequence. This function is applied to each of the components of the
sequence. Indeed, for the first component we have that (−1.525)2 = 2.325625,
for the second component (−1.025)2 = 1.050625, and for the last component
(0.975)2 = 0.950625.
For the variance we sum the square of the deviations and divide by the total
number of data values minus one (n − 1). The standard deviation is obtained
by taking the square root of the variance:
> sum((x – x.bar)^2)/(length(x)-1)
[1] 0.5125
> sqrt(sum((x – x.bar)^2)/(length(x)-1))
[1] 0.715891
If the variance is produced as a result of dividing the sum of squares by the
number of observations minus one then the variance is called the sample variance.
The function “var” computes the sample variance and the function “sd”
computes the standard deviations. The input to both functions is the sequence
of data values and the outputs are the sample variance and the standard deviation, respectively:
> var(x)
[1] 0.5125
40
CHAPTER 3. DESCRIPTIVE STATISTICS
> sd(x)
[1] 0.715891
In the computation of the variance we divide the sum of squared deviations
by the number of deviations minus one and not by the number of deviations.
The reason for that stems from the theory of statistical inference that will be
discussed in Part II of this book. Unless the size of the data is small, dividing
by n or by n − 1 does not introduce much of a difference.
The variance is a squared measure and does not have the same units as
the data. Taking the square root solves the problem. The standard deviation
measures the spread in the same units as the data.
The sample standard deviation, s, is either zero or is larger than zero. When
s = 0, there is no spread and the data values are equal to each other. When s
is a lot larger than zero, the data values are very spread out about the mean.
Outliers can make s very large.
The standard deviation is a number that measures how far data values are
from their mean. For example, if the data contains the value 7 and if the mean
of the data is 5 and the standard deviation is 2, then the value 7 is one standard
deviation from its mean because 5 + 1 × 2 = 7. We say, then, that 7 is one
standard deviation larger than the mean 5 (or also say “to the right of 5”). If
the value 1 was also part of the data set, then 1 is two standard deviations
smaller than the mean (or two standard deviations to the left of 5) because
5 − 2 × 2 = 1.
The standard deviation, when first presented, may not be too simple to
interpret. By graphing your data, you can get a better “feel” for the deviations
and the standard deviation. You will find that in symmetrical distributions, the
standard deviation can be very helpful but in skewed distributions, the standard
deviation is less so. The reason is that the two sides of a skewed distribution
have different spreads. In a skewed distribution, it is better to look at the first
quartile, the median, the third quartile, the smallest value, and the largest value.
3.5
Solved Exercises
Question 3.1. Three sequences of data were saved in 3 R objects named “x1”,
“x2” and “x3”, respectively. The application of the function “summary” to each
of these objects is presented below:
> summary(x1)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.000
2.498
3.218
3.081
3.840
4.871
> summary(x2)
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
0.0001083 0.5772000 1.5070000 1.8420000 2.9050000 4.9880000
> summary(x3)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
2.200
3.391
4.020
4.077
4.690
6.414
In Figure 3.5 one may find the histograms of these three data sequences, given
in a random order. In Figure 3.6 one may find the box plots of the same data,
given in yet a different order.
3.5. SOLVED EXERCISES
0
1
41
2
3
4
5
Histogram 1
2
3
4
5
6
Histogram 2
0
1
2
3
4
5
Histogram 3
Figure 3.5: Three Histograms
1. Match the summary result with the appropriate histogram and the appropriate box plot.
2. Is the value 0.000 in the sequence “x1” an outlier?
3. Is the value 6.414 in the sequence “x3” an outlier?
Solution (to Question 3.1.1): Consider the data “x1”. From the summary
we see that it is distributed in the range between 0 and slightly below 5. The
central 50% of the distribution are located between 2.5 and 3.8. The mean and
median are approximately equal to each other, which suggests an approximately
symmetric distribution. Consider the histograms in Figure 3.5. Histograms 1
and 3 correspond to a distributions in the appropriate range. However, the
distribution in Histogram 3 is concentrated in lower values than suggested by
the given first and third quartiles. Consequently, we match the summary of
“x1” with Histograms 1.
Consider the data “x2”. Again, the distribution is in the range between 0 and
slightly below 5. The central 50% of the distribution are located between 0.6 and
1.8. The mean is larger than the median, which suggests a distribution skewed
CHAPTER 3. DESCRIPTIVE STATISTICS
1
2
3
4
5
6
42
0
●
●
●
Boxplot 1
Boxplot 2
Boxplot 3
Figure 3.6: Three Box Plots
to the right. Therefore, we match the summary of “x2” with Histograms 3.
For the data in “x3” we may note that the distribution is in the range
between 2 and 6. The histogram that fits this description is Histograms 2.
The box plot is essentially a graphical representation of the information presented by the function “summary”. Following the rational of matching the summary with the histograms we may obtain that Histogram 1 should be matched
with Box-plot 2 in Figure 3.6, Histogram 2 matches Box-plot 3, and Histogram 3
matches Box-plot 1. Indeed, it is easier to match the box plots with the summaries. However, it is a good idea to practice the direct matching of histograms
with box plots.
Solution (to Question 3.1.2): The data in “x1” fits Box-plot 2 in Figure 3.6.
The value 0.000 is the smallest value in the data and it corresponds to the
smallest point in the box plot. Since this point is below the bottom whisker it
follows that it is an outlier. More directly, we may note that the inter-quartile
range is equal to IQR = 3.840 − 2.498 = 1.342. The lower threshold is equal to
2.498 − 1.5 × 1.342 = 0.485, which is larger that the given value. Consequently,
the given value 0.000 is an outlier.
3.5. SOLVED EXERCISES
43
Solution (to Question 3.1.3): Observe that the data in “x3” fits Box-plot 3
in Figure 3.6. The vale 6.414 is the largest value in the data and it corresponds
to the endpoint of the upper whisker in the box plot and is not an outlier.
Alternatively, we may note that the inter-quartile range is equal to IQR =
4.690 − 3.391 = 1.299. The upper threshold is equal to 4.690 + 1.5.299 = 6.6385,
which is larger that the given value. Consequently, the given value 6.414 is not
an outlier.
Question 3.2. The number of toilet facilities in 30 buildings were counted.
The results are recorded in an R object by the name “x”. The frequency table
of the data “x” is:
> table(x)
x
2 4 6 8 10
10 6 10 2 2
1. What is the mean (x̄) of the data?
2. What is the sample standard deviation of the data?
3. What is the median of the data?
4. What is the inter-quartile range (IQR) of the data?
5. How many standard deviations away from the mean is the value 10?
Solution (to Question 3.2.1): In order to compute the mean of the data we
may write the following simple R code:
> x.val freq rel.freq x.bar x.bar
[1] 4.666667
We created an object “x.val” that contains the unique values of the data
and an object “freq” that contains the frequencies of the values. The object
“rel.freq” contains the relative frequencies, the ratios between the frequencies
and the number of observations. The average is computed as the sum of the
products of the values with their re…
Why Work with Us
Top Quality and Well-Researched Papers
We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.
Professional and Experienced Academic Writers
We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.
Free Unlimited Revisions
If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.
Prompt Delivery and 100% Money-Back-Guarantee
All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.
Original & Confidential
We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.
24/7 Customer Support
Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.
Try it now!
How it works?
Follow these simple steps to get your paper done
Place your order
Fill in the order form and provide all details of your assignment.
Proceed with the payment
Choose the payment system that suits you most.
Receive the final file
Once your paper is ready, we will email it to you.
Our Services
No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.
Essays
No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.
Admissions
Admission Essays & Business Writing Help
An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.
Reviews
Editing Support
Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.
Reviews
Revision Support
If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.