Week 1 – Univariate and Bivariate Data

The first five chapters in the book address descriptive statistics, and specifically, univariate

statistics (note: we will return to chapter 6 later). The two types of variables that will be

considered will be categorical (how many cases of what is measured) and quantitative (quantity

of what is measured).

Categorical data can be illustrated in several ways – frequency tables (and relative frequency

tables), bar charts, and pie charts are some of the most common. Quantitative data can be

illustrated using histograms (and relative frequency histograms), stem-and-leaf plots, dotplots,

and boxplots. One of the key purposes of statistics is to make meaning of complex data sets, and

visual displays can help us with this meaning making. Specifically, for quantitative data, it is

helpful to understand the shape, center, and spread.

If you consult pp. 49 – 52, you will get a decent description of key terms when thinking about

shape. Is it unimodal/multimodal? To what degree does it have symmetry? Is it skewed left

(longer tail to the left) or right (longer tail to the right)? Are there possible outliers? Center is a

way to suggest a typical value, and this is typically done with the median or mean (note: consult

pp. 59-60 on mean and median; also consider the use of mode as a center). Spread gives us a

sense on how clustered or spread out our data are, and range, IQR, mean absolute deviation,

variance, and standard deviation attempt to capture spread. Standard deviation is a key statistic,

and you can learn a bit more about how it is calculated on p. 61 (in short, it is the square root of

the average of the squared deviations from the mean).

Boxplots are nice because they give a more detailed visual of a data set, and the key components

of shape, center, and spread. Boxplots are based on the five-number summary (the minimum, the

1st quartile, the 2nd quartile/median, the 3rd quartile, and the maximum). You can find this

summary by finding the median, and then finding the median of the first half of data, and the

median of the 2nd half of the data. Additionally, see pg. 81 on how to use the 1.5*IQR rule to

determine any outliers. Boxplots are great tools for comparing data sets, as are histograms.

Let’s do some examples.

a) This distribution is symmetric (as observed in the histogram). The bicep procedure looks

to be skewed left, and the deltoid procedure appears to be tightly concentrated, with some

outliers.

b) The range of strength scores is approximately 4 (it is unclear, due to the nature of the bin

size). The range for the bicep procedure is just over 2, and the range for the deltoid

procedure is approximately 2.

c) We do not see the exact strength scores for the two procedures.

d) The biceps method had the higher median score.

e) The biceps method was not always best. There was a deltoid outlier that was higher than

approximately 25% of the biceps scores, and there may have some other deltoid scores

that were marginally higher than some of the lower biceps scores as well.

f) The deltoid procedure produced the most consistent results. With the exception of the two

outliers, all of the deltoid procedures yielded strength scores in a very narrow range.

Note: Unlike what you may have experienced in previous math courses, it is essential to draw on

context as you analyze and interpret data using statistics and statistical representations.

a) Throughout this course, I am going to suggest how Excel can be utilized to do statistical

analysis. First, I will plug these data values into Excel. Assuming I did that correctly, I can then

use Excel commands to answer this question.

Using =MEDIAN(), where I highlight my data and place it inside the parentheses (so it would

actually read =MEDIAN(A1:A50), because my data are in A1 through A50), I would get 239.

To find IQR, I need to subtract the 1st quartile from the 3rd quartile. I will do this in the following

way:

=QUARTILE.EXC(A1:A50,3)-QUARTILE.EXC(A1:A50,1)

I wouldn’t get too concerned about QUARTILE.EXC v. QUARTILE.INC (exclusive v.

inclusive). The quartile functions, though, ask for the data set first, follow by a comma, followed

by which quartile you are looking for. By the way, the IQR is 9.

To find mean, I would use =AVERAGE(A1:A50), which gives me 237.64.

To find standard deviation, you can use =STDEV.P() or =STDEV.S() to find population or

sample standard deviation. We will get into this in more detail later, but for now, let’s use

=STDEV.P(A1:A50), which gives us 5.63.

b) Let’s go further and create a boxplot. To create a boxplot, first we need to do a five-number

summary, so we need our min, 1st quartile, median, 3rd quartile, and max. Using =MIN(),

=QUARTILE.EXC( ,1), =MEDIAN(), =QUARTILE.EXC( ,3), and =MAX(). Our five-number

summary, then, would be 224, 233, 239, 242, and 247. Next, we need to find outliers. In order to

do so I need to check if any data points fall outside Q1 – 1.5*IQR on the low end, or Q3 +

1.5*IQR on the high end. 1.5*IQR = 1.5 * 9 = 13.5. Q1 – 13.5 = 219.5, and nothing falls below

that. Q3 + 13.5 = 255.5, and nothing falls above that. Therefore, there are no outliers. Excel can

create a boxplot if you finesse it a bit, but I can be lazy, and I don’t want to do all that work if

there is an easier program out there. If you go to

http://www.alcula.com/calculators/statistics/box-plot/, then copy and paste in your data, you

should get something that looks like this:

This is a vertically positioned boxplot, but it does the trick. Notice how nicely the boxplot shows

characteristics of the data set. You can see a longer “tail” to the left (bottom), so the data set may

be slightly skewed left. You can see that the right side (top) of the box is more condensed than

the left (bottom), and you can also clearly see the four quartiles (the first whisker, the first half of

the box, the second half of the box, and the second whisker), as well as the mid 50% (the box).

Next, we transition from univariate (one variable) to bivariate (two variable) statistics. One of the

best ways to understand the relationship between two variables is to create a scatterplot. When

analyzing a scatterplot, we want to pay attention to direction (positive/negative) and correlation

(strong/weak/no), and general shape (linear, exponential, etc.). The correlation coefficient gives

us a good deal of information on direction and correlation for linear relationships (see p. 156),

and when data are not particularly linear, it is often beneficial to re-express the data so that we do

see a straighter relationship (see pp. 158-159). I would also suggest reviewing pp. 157-158,

because it is essential to understand that correlation does not imply causation. For example, if I

graphed the average global temperature as a dependent variable, and the number of pirates

worldwide as the independent variable, I would notice a positive correlation (see link for image).

That does not mean that annual global temperature is dependent on the number of pirates. More

likely, there is a lurking variable (or several lurking variables), such as population growth.

It is often helpful to create a model for bivariate data. If the data is somewhat linear, we can draw

a line of best fit (or linear regression). You will see that technology can create this line of best fit,

but the mathematics behind it has to do with residuals and least squares (see pp. 172-173 if

interested). When using linear regressions, you will have to draw on your knowledge of algebra

to utilize slopes, y-intercepts, and ordered pairs. We will do some examples at the end of these

notes to refresh your memory. There are some additional items to attend to in chapter 8. A

scatterplot of residuals should have no discernible pattern – if it does, then your original

regression may have been incorrect (i.e. you used a linear regression instead of a cubic

regression). With regressions in general (both linear and other types), R^2 is a very useful tool.

R^2 gives the fraction (can be converted to a percent) of the data’s variation that the model

captures. An R^2 of 1 (100%) would indicate that the model perfectly accounts for the data’s

variation. An R^2 of 0 (0%) indicates that the model captures none of the data’s variation. Please

consult pp. 184-185 for assumptions and conditions when performing a regression, and it will

also be important to always consider whether the regression model is reasonable (see p. 188).

Chapter 9 addresses the use of regression models to make predictions, and the care with which

we need to take when extrapolating (as opposed to interpolating). The chapter discussed outliers,

as well as points with high leverage or influence, and it reiterates the extremely important fact

that correlation does not imply causation. Chapter 10 describes how to re-express data to be

linear, but we are not going to spend too much energy on this. As an aside, I completed a

capstone project for my B.A. where I explored the impact of the 1970 Clean Air Act

Amendments by analyzing the amount of pollution in different sectors as a function of time. In

order to perform this analysis effectively, I had to take the logarithm of the pollution variables

before completing my analysis. We can forgo that for now, but in the event that you may do this

sort of work, logarithms can often be your friend.

Examples:

Note: You may be more familiar with this line as follows: G = 0.11M + 2.73, where G represents

GPA hat (predicted GPA), and M represents number of meals eaten with family.

Note 2: Since there is no pattern in the residuals, that gives us an indication that the linear

regression is appropriate.

a. The y-intercept of 2.73 represents the predicted GPA of a student who ate 0 meals with

their family each week.

b. 0.11 indicates that expected increase in GPA, per meal eaten with family each week.

c. Skipped

d. The student’s actual GPA is less than the GPS predicted by the model.

e. This is an example of someone believing that correlation implies causation. While it may

be true that students who eat more meals with their families tend to have a higher GPA,

that does not mean that eating more meals with families will cause a higher GPA. There

may be a lurking variable (i.e. more involved families leads both to higher GPA and more

meals together).

Let’s use Excel! As opposed to our univariate example in week 1, now we are going to enter data

in two columns. As a suggestion, it turns out that when using years in statistical analysis, it often

becomes very cumbersome to use the actual numerical values. I am going to suggest that we

reframe our independent variable to be “years since 1980.” If so, our table will look like this:

Years since

Twin

1980

Births

0

68339

1

70049

2

71631

3

72287

4

72949

5

77102

6

79485

7

81778

8

85315

9

90118

10

93865

11

12

13

14

15

16

17

18

19

20

21

22

23

24

94779

95372

96445

97064

96736

100750

104137

110670

114307

118916

121246

125134

128665

132219

Note: I used headers at the top of each column, which will prove helpful later. Also, when

entering things like years, you could type in the first two years, highlight two years, move to the

bottom right of the cells until you get a black cross, and then drag down. Excel is smart enough

to know that you want to create an arithmetic sequence, where each adjacent cell increases by the

same amount.

a. OK. Let’s make a scatter plot first. Highlight your data, then go to the “Insert” tab and

click on the thing that looks like a Scatter Plot, then click “Scatter.” You should get

something that looks like this:

Twin Births

140000

120000

100000

80000

60000

40000

20000

0

0

5

10

15

20

25

30

When the scatter plot is selected, you will see at the top left an option that says “Add Chart

Element.” Click it, click “trendline,” then click “linear.” You should get a dotted line over your

scatter plot. When you double click on the trendline, you should get a menu. At the bottom, click

“display equation on chart” and “display r-squared value on chart.” Then, you should get this

(note: I moved the regression equation and increased the font).

y = 2618.3x + 64555

R² = 0.97455

Twin Births

140000

120000

100000

80000

60000

40000

20000

0

0

5

10

15

20

25

30

The equation, then, would be y = 2618.3x +64,555, or, if you’d prefer Births (hat) = 64,555 +

2618.3 * years since 1980.

b. The y-intercept, 64,555, would indicate the predicted number of twin births in 1980. The

slope, 2618.3, is the predicted increase in the number of twin births per year.

c. To predict the number of twin births in 2010, I will plug 30 into my regression equation,

and get Births (hat) = 64,555 + 2618.3 * 30, which equals approximately 143,104.

Extrapolation is always risky, but 2010 is close enough to our data set, and the number

certainly seems reasonable.

Week 1 Problem Set:

1.

A bakery is trying to predict how many loaves to bake. In the last 100 days, they have

sold between 95 and 140 loaves per day. Here is the histogram of the number of

loaves they sold for the last 100 days.

2.

a. Describe the distribution.

b. Which should be larger, the mean number of sales or the median? Explain.

Average daily temperatures in January and July for 60 large US cities are graphed in

the histogram below.

3.

a. What aspect of these histograms makes it difficult to compare the

distributions?

b. What differences do you see between the distributions of January and July

average temperatures?

Roger Maris’s 1961 home run record stood until Mark McGwire hit 70 in 1998. Listed

below are the home run totals for each season McGwire played. Also listed are Babe

Ruth’s home run totals.

a. Find the 5-number summary for McGwire’s career.

b. Do any of his seasons appear to be outliers? Explain.

c. McGwire played in only 18 games at the end of his first big league season, and

missed major portions of some other seasons because of injuries to his back

4.

5.

6.

and knees. Those seasons might not be representative of his abilities. They are

marked with asterisks in the list above. Omit these values and make parallel

boxplots comparing McGwire’s career to Babe Ruth’s.

d. Write a few sentences comparing the two sluggers.

Here is a stem-and-leaf display showing profits as a percent of sales for 29 of the

Forbes 500 largest US corporations. The stems are split; each stem represents a span

of 5%, from a loss of 9% to a profit of 25%.

a. Find the 5-number summary.

b. Draw a boxplot for these data.

c. Find the mean and standard deviation.

d. Describe the distribution of profits for these corporations.

Every year US News and World Report published a special issue on many US colleges

and universities. The scatterplots have Student/Faculty Ratio for the colleges and

universities on the y-axes plotted against the other 4 variables. The correct

correlations for these scatterplots appear in this list. Match them.

Here are the scatterplot and regression analysis for Case Prices of 36 wines from

vineyards in NY State and the Ages of the vineyards.

7.

8.

a. Does it appear that vineyards in business longer get higher prices for their

wines? Explain.

b. What does this analysis tell us about vineyards in the rest of the world?

c. Write the regression equation.

d. How valuable is this equation? Explain.

One Thursday, researchers gave students in a section of Spanish a set of 50 new

vocabulary words to memorize. On Friday students took a vocabulary test. When they

returned to class the next Monday, they were retested – without advance warning.

Here are the test scores for the 25 students.

a. What is the correlation between Friday and Monday scores?

b. What does a scatterplot show about the association between the scores?

c. Write the equation of the regression line.

d. Predict the Monday score of a student who earned a 40 on Friday.

The table below shows a 50-state average of the percent of expectant mothers who

smoked cigarettes during their pregnancies.

a. Create a scatterplot and describe the trend you see.

b. Find the correlation.

c. Write a linear model and interpret the slope in context.

Order your essay today and save **25%** with the discount code: GREEN