Week 1 – Univariate and Bivariate Data
The first five chapters in the book address descriptive statistics, and specifically, univariate
statistics (note: we will return to chapter 6 later). The two types of variables that will be
considered will be categorical (how many cases of what is measured) and quantitative (quantity
of what is measured).
Categorical data can be illustrated in several ways – frequency tables (and relative frequency
tables), bar charts, and pie charts are some of the most common. Quantitative data can be
illustrated using histograms (and relative frequency histograms), stem-and-leaf plots, dotplots,
and boxplots. One of the key purposes of statistics is to make meaning of complex data sets, and
visual displays can help us with this meaning making. Specifically, for quantitative data, it is
helpful to understand the shape, center, and spread.
If you consult pp. 49 – 52, you will get a decent description of key terms when thinking about
shape. Is it unimodal/multimodal? To what degree does it have symmetry? Is it skewed left
(longer tail to the left) or right (longer tail to the right)? Are there possible outliers? Center is a
way to suggest a typical value, and this is typically done with the median or mean (note: consult
pp. 59-60 on mean and median; also consider the use of mode as a center). Spread gives us a
sense on how clustered or spread out our data are, and range, IQR, mean absolute deviation,
variance, and standard deviation attempt to capture spread. Standard deviation is a key statistic,
and you can learn a bit more about how it is calculated on p. 61 (in short, it is the square root of
the average of the squared deviations from the mean).
Boxplots are nice because they give a more detailed visual of a data set, and the key components
of shape, center, and spread. Boxplots are based on the five-number summary (the minimum, the
1st quartile, the 2nd quartile/median, the 3rd quartile, and the maximum). You can find this
summary by finding the median, and then finding the median of the first half of data, and the
median of the 2nd half of the data. Additionally, see pg. 81 on how to use the 1.5*IQR rule to
determine any outliers. Boxplots are great tools for comparing data sets, as are histograms.
Let’s do some examples.
a) This distribution is symmetric (as observed in the histogram). The bicep procedure looks
to be skewed left, and the deltoid procedure appears to be tightly concentrated, with some
outliers.
b) The range of strength scores is approximately 4 (it is unclear, due to the nature of the bin
size). The range for the bicep procedure is just over 2, and the range for the deltoid
procedure is approximately 2.
c) We do not see the exact strength scores for the two procedures.
d) The biceps method had the higher median score.
e) The biceps method was not always best. There was a deltoid outlier that was higher than
approximately 25% of the biceps scores, and there may have some other deltoid scores
that were marginally higher than some of the lower biceps scores as well.
f) The deltoid procedure produced the most consistent results. With the exception of the two
outliers, all of the deltoid procedures yielded strength scores in a very narrow range.
Note: Unlike what you may have experienced in previous math courses, it is essential to draw on
context as you analyze and interpret data using statistics and statistical representations.
a) Throughout this course, I am going to suggest how Excel can be utilized to do statistical
analysis. First, I will plug these data values into Excel. Assuming I did that correctly, I can then
use Excel commands to answer this question.
Using =MEDIAN(), where I highlight my data and place it inside the parentheses (so it would
actually read =MEDIAN(A1:A50), because my data are in A1 through A50), I would get 239.
To find IQR, I need to subtract the 1st quartile from the 3rd quartile. I will do this in the following
way:
=QUARTILE.EXC(A1:A50,3)-QUARTILE.EXC(A1:A50,1)
I wouldn’t get too concerned about QUARTILE.EXC v. QUARTILE.INC (exclusive v.
inclusive). The quartile functions, though, ask for the data set first, follow by a comma, followed
by which quartile you are looking for. By the way, the IQR is 9.
To find mean, I would use =AVERAGE(A1:A50), which gives me 237.64.
To find standard deviation, you can use =STDEV.P() or =STDEV.S() to find population or
sample standard deviation. We will get into this in more detail later, but for now, let’s use
=STDEV.P(A1:A50), which gives us 5.63.
b) Let’s go further and create a boxplot. To create a boxplot, first we need to do a five-number
summary, so we need our min, 1st quartile, median, 3rd quartile, and max. Using =MIN(),
=QUARTILE.EXC( ,1), =MEDIAN(), =QUARTILE.EXC( ,3), and =MAX(). Our five-number
summary, then, would be 224, 233, 239, 242, and 247. Next, we need to find outliers. In order to
do so I need to check if any data points fall outside Q1 – 1.5*IQR on the low end, or Q3 +
1.5*IQR on the high end. 1.5*IQR = 1.5 * 9 = 13.5. Q1 – 13.5 = 219.5, and nothing falls below
that. Q3 + 13.5 = 255.5, and nothing falls above that. Therefore, there are no outliers. Excel can
create a boxplot if you finesse it a bit, but I can be lazy, and I don’t want to do all that work if
there is an easier program out there. If you go to
http://www.alcula.com/calculators/statistics/box-plot/, then copy and paste in your data, you
should get something that looks like this:
This is a vertically positioned boxplot, but it does the trick. Notice how nicely the boxplot shows
characteristics of the data set. You can see a longer “tail” to the left (bottom), so the data set may
be slightly skewed left. You can see that the right side (top) of the box is more condensed than
the left (bottom), and you can also clearly see the four quartiles (the first whisker, the first half of
the box, the second half of the box, and the second whisker), as well as the mid 50% (the box).
Next, we transition from univariate (one variable) to bivariate (two variable) statistics. One of the
best ways to understand the relationship between two variables is to create a scatterplot. When
analyzing a scatterplot, we want to pay attention to direction (positive/negative) and correlation
(strong/weak/no), and general shape (linear, exponential, etc.). The correlation coefficient gives
us a good deal of information on direction and correlation for linear relationships (see p. 156),
and when data are not particularly linear, it is often beneficial to re-express the data so that we do
see a straighter relationship (see pp. 158-159). I would also suggest reviewing pp. 157-158,
because it is essential to understand that correlation does not imply causation. For example, if I
graphed the average global temperature as a dependent variable, and the number of pirates
worldwide as the independent variable, I would notice a positive correlation (see link for image).
That does not mean that annual global temperature is dependent on the number of pirates. More
likely, there is a lurking variable (or several lurking variables), such as population growth.
It is often helpful to create a model for bivariate data. If the data is somewhat linear, we can draw
a line of best fit (or linear regression). You will see that technology can create this line of best fit,
but the mathematics behind it has to do with residuals and least squares (see pp. 172-173 if
interested). When using linear regressions, you will have to draw on your knowledge of algebra
to utilize slopes, y-intercepts, and ordered pairs. We will do some examples at the end of these
notes to refresh your memory. There are some additional items to attend to in chapter 8. A
scatterplot of residuals should have no discernible pattern – if it does, then your original
regression may have been incorrect (i.e. you used a linear regression instead of a cubic
regression). With regressions in general (both linear and other types), R^2 is a very useful tool.
R^2 gives the fraction (can be converted to a percent) of the data’s variation that the model
captures. An R^2 of 1 (100%) would indicate that the model perfectly accounts for the data’s
variation. An R^2 of 0 (0%) indicates that the model captures none of the data’s variation. Please
consult pp. 184-185 for assumptions and conditions when performing a regression, and it will
also be important to always consider whether the regression model is reasonable (see p. 188).
Chapter 9 addresses the use of regression models to make predictions, and the care with which
we need to take when extrapolating (as opposed to interpolating). The chapter discussed outliers,
as well as points with high leverage or influence, and it reiterates the extremely important fact
that correlation does not imply causation. Chapter 10 describes how to re-express data to be
linear, but we are not going to spend too much energy on this. As an aside, I completed a
capstone project for my B.A. where I explored the impact of the 1970 Clean Air Act
Amendments by analyzing the amount of pollution in different sectors as a function of time. In
order to perform this analysis effectively, I had to take the logarithm of the pollution variables
before completing my analysis. We can forgo that for now, but in the event that you may do this
sort of work, logarithms can often be your friend.
Examples:
Note: You may be more familiar with this line as follows: G = 0.11M + 2.73, where G represents
GPA hat (predicted GPA), and M represents number of meals eaten with family.
Note 2: Since there is no pattern in the residuals, that gives us an indication that the linear
regression is appropriate.
a. The y-intercept of 2.73 represents the predicted GPA of a student who ate 0 meals with
their family each week.
b. 0.11 indicates that expected increase in GPA, per meal eaten with family each week.
c. Skipped
d. The student’s actual GPA is less than the GPS predicted by the model.
e. This is an example of someone believing that correlation implies causation. While it may
be true that students who eat more meals with their families tend to have a higher GPA,
that does not mean that eating more meals with families will cause a higher GPA. There
may be a lurking variable (i.e. more involved families leads both to higher GPA and more
meals together).
Let’s use Excel! As opposed to our univariate example in week 1, now we are going to enter data
in two columns. As a suggestion, it turns out that when using years in statistical analysis, it often
becomes very cumbersome to use the actual numerical values. I am going to suggest that we
reframe our independent variable to be “years since 1980.” If so, our table will look like this:
Years since
Twin
1980
Births
0
68339
1
70049
2
71631
3
72287
4
72949
5
77102
6
79485
7
81778
8
85315
9
90118
10
93865
11
12
13
14
15
16
17
18
19
20
21
22
23
24
94779
95372
96445
97064
96736
100750
104137
110670
114307
118916
121246
125134
128665
132219
Note: I used headers at the top of each column, which will prove helpful later. Also, when
entering things like years, you could type in the first two years, highlight two years, move to the
bottom right of the cells until you get a black cross, and then drag down. Excel is smart enough
to know that you want to create an arithmetic sequence, where each adjacent cell increases by the
same amount.
a. OK. Let’s make a scatter plot first. Highlight your data, then go to the “Insert” tab and
click on the thing that looks like a Scatter Plot, then click “Scatter.” You should get
something that looks like this:
Twin Births
140000
120000
100000
80000
60000
40000
20000
0
0
5
10
15
20
25
30
When the scatter plot is selected, you will see at the top left an option that says “Add Chart
Element.” Click it, click “trendline,” then click “linear.” You should get a dotted line over your
scatter plot. When you double click on the trendline, you should get a menu. At the bottom, click
“display equation on chart” and “display r-squared value on chart.” Then, you should get this
(note: I moved the regression equation and increased the font).
y = 2618.3x + 64555
R² = 0.97455
Twin Births
140000
120000
100000
80000
60000
40000
20000
0
0
5
10
15
20
25
30
The equation, then, would be y = 2618.3x +64,555, or, if you’d prefer Births (hat) = 64,555 +
2618.3 * years since 1980.
b. The y-intercept, 64,555, would indicate the predicted number of twin births in 1980. The
slope, 2618.3, is the predicted increase in the number of twin births per year.
c. To predict the number of twin births in 2010, I will plug 30 into my regression equation,
and get Births (hat) = 64,555 + 2618.3 * 30, which equals approximately 143,104.
Extrapolation is always risky, but 2010 is close enough to our data set, and the number
certainly seems reasonable.
Week 1 Problem Set:
1.
A bakery is trying to predict how many loaves to bake. In the last 100 days, they have
sold between 95 and 140 loaves per day. Here is the histogram of the number of
loaves they sold for the last 100 days.
2.
a. Describe the distribution.
b. Which should be larger, the mean number of sales or the median? Explain.
Average daily temperatures in January and July for 60 large US cities are graphed in
the histogram below.
3.
a. What aspect of these histograms makes it difficult to compare the
distributions?
b. What differences do you see between the distributions of January and July
average temperatures?
Roger Maris’s 1961 home run record stood until Mark McGwire hit 70 in 1998. Listed
below are the home run totals for each season McGwire played. Also listed are Babe
Ruth’s home run totals.
a. Find the 5-number summary for McGwire’s career.
b. Do any of his seasons appear to be outliers? Explain.
c. McGwire played in only 18 games at the end of his first big league season, and
missed major portions of some other seasons because of injuries to his back
4.
5.
6.
and knees. Those seasons might not be representative of his abilities. They are
marked with asterisks in the list above. Omit these values and make parallel
boxplots comparing McGwire’s career to Babe Ruth’s.
d. Write a few sentences comparing the two sluggers.
Here is a stem-and-leaf display showing profits as a percent of sales for 29 of the
Forbes 500 largest US corporations. The stems are split; each stem represents a span
of 5%, from a loss of 9% to a profit of 25%.
a. Find the 5-number summary.
b. Draw a boxplot for these data.
c. Find the mean and standard deviation.
d. Describe the distribution of profits for these corporations.
Every year US News and World Report published a special issue on many US colleges
and universities. The scatterplots have Student/Faculty Ratio for the colleges and
universities on the y-axes plotted against the other 4 variables. The correct
correlations for these scatterplots appear in this list. Match them.
Here are the scatterplot and regression analysis for Case Prices of 36 wines from
vineyards in NY State and the Ages of the vineyards.
7.
8.
a. Does it appear that vineyards in business longer get higher prices for their
wines? Explain.
b. What does this analysis tell us about vineyards in the rest of the world?
c. Write the regression equation.
d. How valuable is this equation? Explain.
One Thursday, researchers gave students in a section of Spanish a set of 50 new
vocabulary words to memorize. On Friday students took a vocabulary test. When they
returned to class the next Monday, they were retested – without advance warning.
Here are the test scores for the 25 students.
a. What is the correlation between Friday and Monday scores?
b. What does a scatterplot show about the association between the scores?
c. Write the equation of the regression line.
d. Predict the Monday score of a student who earned a 40 on Friday.
The table below shows a 50-state average of the percent of expectant mothers who
smoked cigarettes during their pregnancies.
a. Create a scatterplot and describe the trend you see.
b. Find the correlation.
c. Write a linear model and interpret the slope in context.