box plots, analyzing, correlation

  • Students will be able to differentiate between categorical and quantitative variables.
  • Students will be able to describe how to visually display categorical and quantitative variables.
  • Students will be able to describe the shape, center, and spread of quantitative variables.\
  • Students will be able to find mean, median, and mode.
  • Students will be able to find range, IQR, and standard deviation.
  • Students will be able to calculate a five-number summary, find outliers, and draw boxplots.
  • Students will be able to construct, analyze, and interpret scatter plots.
  • Students will be able to interpret the correlation coefficient and R^2.
  • Students will be able to create and utilize linear regressions.
  • Students will be able to discuss residuals and analyze fit.
  • Students will be able to discuss the relationship between correlation and causation.
  • Week 1 – Univariate and Bivariate Data
    The first five chapters in the book address descriptive statistics, and specifically, univariate
    statistics (note: we will return to chapter 6 later). The two types of variables that will be
    considered will be categorical (how many cases of what is measured) and quantitative (quantity
    of what is measured).
    Categorical data can be illustrated in several ways – frequency tables (and relative frequency
    tables), bar charts, and pie charts are some of the most common. Quantitative data can be
    illustrated using histograms (and relative frequency histograms), stem-and-leaf plots, dotplots,
    and boxplots. One of the key purposes of statistics is to make meaning of complex data sets, and
    visual displays can help us with this meaning making. Specifically, for quantitative data, it is
    helpful to understand the shape, center, and spread.
    If you consult pp. 49 – 52, you will get a decent description of key terms when thinking about
    shape. Is it unimodal/multimodal? To what degree does it have symmetry? Is it skewed left
    (longer tail to the left) or right (longer tail to the right)? Are there possible outliers? Center is a
    way to suggest a typical value, and this is typically done with the median or mean (note: consult
    pp. 59-60 on mean and median; also consider the use of mode as a center). Spread gives us a
    sense on how clustered or spread out our data are, and range, IQR, mean absolute deviation,
    variance, and standard deviation attempt to capture spread. Standard deviation is a key statistic,
    and you can learn a bit more about how it is calculated on p. 61 (in short, it is the square root of
    the average of the squared deviations from the mean).
    Boxplots are nice because they give a more detailed visual of a data set, and the key components
    of shape, center, and spread. Boxplots are based on the five-number summary (the minimum, the
    1st quartile, the 2nd quartile/median, the 3rd quartile, and the maximum). You can find this
    summary by finding the median, and then finding the median of the first half of data, and the
    median of the 2nd half of the data. Additionally, see pg. 81 on how to use the 1.5*IQR rule to
    determine any outliers. Boxplots are great tools for comparing data sets, as are histograms.
    Let’s do some examples.
    a) This distribution is symmetric (as observed in the histogram). The bicep procedure looks
    to be skewed left, and the deltoid procedure appears to be tightly concentrated, with some
    outliers.
    b) The range of strength scores is approximately 4 (it is unclear, due to the nature of the bin
    size). The range for the bicep procedure is just over 2, and the range for the deltoid
    procedure is approximately 2.
    c) We do not see the exact strength scores for the two procedures.
    d) The biceps method had the higher median score.
    e) The biceps method was not always best. There was a deltoid outlier that was higher than
    approximately 25% of the biceps scores, and there may have some other deltoid scores
    that were marginally higher than some of the lower biceps scores as well.
    f) The deltoid procedure produced the most consistent results. With the exception of the two
    outliers, all of the deltoid procedures yielded strength scores in a very narrow range.
    Note: Unlike what you may have experienced in previous math courses, it is essential to draw on
    context as you analyze and interpret data using statistics and statistical representations.
    a) Throughout this course, I am going to suggest how Excel can be utilized to do statistical
    analysis. First, I will plug these data values into Excel. Assuming I did that correctly, I can then
    use Excel commands to answer this question.
    Using =MEDIAN(), where I highlight my data and place it inside the parentheses (so it would
    actually read =MEDIAN(A1:A50), because my data are in A1 through A50), I would get 239.
    To find IQR, I need to subtract the 1st quartile from the 3rd quartile. I will do this in the following
    way:
    =QUARTILE.EXC(A1:A50,3)-QUARTILE.EXC(A1:A50,1)
    I wouldn’t get too concerned about QUARTILE.EXC v. QUARTILE.INC (exclusive v.
    inclusive). The quartile functions, though, ask for the data set first, follow by a comma, followed
    by which quartile you are looking for. By the way, the IQR is 9.
    To find mean, I would use =AVERAGE(A1:A50), which gives me 237.64.
    To find standard deviation, you can use =STDEV.P() or =STDEV.S() to find population or
    sample standard deviation. We will get into this in more detail later, but for now, let’s use
    =STDEV.P(A1:A50), which gives us 5.63.
    b) Let’s go further and create a boxplot. To create a boxplot, first we need to do a five-number
    summary, so we need our min, 1st quartile, median, 3rd quartile, and max. Using =MIN(),
    =QUARTILE.EXC( ,1), =MEDIAN(), =QUARTILE.EXC( ,3), and =MAX(). Our five-number
    summary, then, would be 224, 233, 239, 242, and 247. Next, we need to find outliers. In order to
    do so I need to check if any data points fall outside Q1 – 1.5*IQR on the low end, or Q3 +
    1.5*IQR on the high end. 1.5*IQR = 1.5 * 9 = 13.5. Q1 – 13.5 = 219.5, and nothing falls below
    that. Q3 + 13.5 = 255.5, and nothing falls above that. Therefore, there are no outliers. Excel can
    create a boxplot if you finesse it a bit, but I can be lazy, and I don’t want to do all that work if
    there is an easier program out there. If you go to
    http://www.alcula.com/calculators/statistics/box-plot/, then copy and paste in your data, you
    should get something that looks like this:
    This is a vertically positioned boxplot, but it does the trick. Notice how nicely the boxplot shows
    characteristics of the data set. You can see a longer “tail” to the left (bottom), so the data set may
    be slightly skewed left. You can see that the right side (top) of the box is more condensed than
    the left (bottom), and you can also clearly see the four quartiles (the first whisker, the first half of
    the box, the second half of the box, and the second whisker), as well as the mid 50% (the box).
    Next, we transition from univariate (one variable) to bivariate (two variable) statistics. One of the
    best ways to understand the relationship between two variables is to create a scatterplot. When
    analyzing a scatterplot, we want to pay attention to direction (positive/negative) and correlation
    (strong/weak/no), and general shape (linear, exponential, etc.). The correlation coefficient gives
    us a good deal of information on direction and correlation for linear relationships (see p. 156),
    and when data are not particularly linear, it is often beneficial to re-express the data so that we do
    see a straighter relationship (see pp. 158-159). I would also suggest reviewing pp. 157-158,
    because it is essential to understand that correlation does not imply causation. For example, if I
    graphed the average global temperature as a dependent variable, and the number of pirates
    worldwide as the independent variable, I would notice a positive correlation (see link for image).
    That does not mean that annual global temperature is dependent on the number of pirates. More
    likely, there is a lurking variable (or several lurking variables), such as population growth.
    It is often helpful to create a model for bivariate data. If the data is somewhat linear, we can draw
    a line of best fit (or linear regression). You will see that technology can create this line of best fit,
    but the mathematics behind it has to do with residuals and least squares (see pp. 172-173 if
    interested). When using linear regressions, you will have to draw on your knowledge of algebra
    to utilize slopes, y-intercepts, and ordered pairs. We will do some examples at the end of these
    notes to refresh your memory. There are some additional items to attend to in chapter 8. A
    scatterplot of residuals should have no discernible pattern – if it does, then your original
    regression may have been incorrect (i.e. you used a linear regression instead of a cubic
    regression). With regressions in general (both linear and other types), R^2 is a very useful tool.
    R^2 gives the fraction (can be converted to a percent) of the data’s variation that the model
    captures. An R^2 of 1 (100%) would indicate that the model perfectly accounts for the data’s
    variation. An R^2 of 0 (0%) indicates that the model captures none of the data’s variation. Please
    consult pp. 184-185 for assumptions and conditions when performing a regression, and it will
    also be important to always consider whether the regression model is reasonable (see p. 188).
    Chapter 9 addresses the use of regression models to make predictions, and the care with which
    we need to take when extrapolating (as opposed to interpolating). The chapter discussed outliers,
    as well as points with high leverage or influence, and it reiterates the extremely important fact
    that correlation does not imply causation. Chapter 10 describes how to re-express data to be
    linear, but we are not going to spend too much energy on this. As an aside, I completed a
    capstone project for my B.A. where I explored the impact of the 1970 Clean Air Act
    Amendments by analyzing the amount of pollution in different sectors as a function of time. In
    order to perform this analysis effectively, I had to take the logarithm of the pollution variables
    before completing my analysis. We can forgo that for now, but in the event that you may do this
    sort of work, logarithms can often be your friend.
    Examples:
    Note: You may be more familiar with this line as follows: G = 0.11M + 2.73, where G represents
    GPA hat (predicted GPA), and M represents number of meals eaten with family.
    Note 2: Since there is no pattern in the residuals, that gives us an indication that the linear
    regression is appropriate.
    a. The y-intercept of 2.73 represents the predicted GPA of a student who ate 0 meals with
    their family each week.
    b. 0.11 indicates that expected increase in GPA, per meal eaten with family each week.
    c. Skipped
    d. The student’s actual GPA is less than the GPS predicted by the model.
    e. This is an example of someone believing that correlation implies causation. While it may
    be true that students who eat more meals with their families tend to have a higher GPA,
    that does not mean that eating more meals with families will cause a higher GPA. There
    may be a lurking variable (i.e. more involved families leads both to higher GPA and more
    meals together).
    Let’s use Excel! As opposed to our univariate example in week 1, now we are going to enter data
    in two columns. As a suggestion, it turns out that when using years in statistical analysis, it often
    becomes very cumbersome to use the actual numerical values. I am going to suggest that we
    reframe our independent variable to be “years since 1980.” If so, our table will look like this:
    Years since
    Twin
    1980
    Births
    0
    68339
    1
    70049
    2
    71631
    3
    72287
    4
    72949
    5
    77102
    6
    79485
    7
    81778
    8
    85315
    9
    90118
    10
    93865
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    94779
    95372
    96445
    97064
    96736
    100750
    104137
    110670
    114307
    118916
    121246
    125134
    128665
    132219
    Note: I used headers at the top of each column, which will prove helpful later. Also, when
    entering things like years, you could type in the first two years, highlight two years, move to the
    bottom right of the cells until you get a black cross, and then drag down. Excel is smart enough
    to know that you want to create an arithmetic sequence, where each adjacent cell increases by the
    same amount.
    a. OK. Let’s make a scatter plot first. Highlight your data, then go to the “Insert” tab and
    click on the thing that looks like a Scatter Plot, then click “Scatter.” You should get
    something that looks like this:
    Twin Births
    140000
    120000
    100000
    80000
    60000
    40000
    20000
    0
    0
    5
    10
    15
    20
    25
    30
    When the scatter plot is selected, you will see at the top left an option that says “Add Chart
    Element.” Click it, click “trendline,” then click “linear.” You should get a dotted line over your
    scatter plot. When you double click on the trendline, you should get a menu. At the bottom, click
    “display equation on chart” and “display r-squared value on chart.” Then, you should get this
    (note: I moved the regression equation and increased the font).
    y = 2618.3x + 64555
    R² = 0.97455
    Twin Births
    140000
    120000
    100000
    80000
    60000
    40000
    20000
    0
    0
    5
    10
    15
    20
    25
    30
    The equation, then, would be y = 2618.3x +64,555, or, if you’d prefer Births (hat) = 64,555 +
    2618.3 * years since 1980.
    b. The y-intercept, 64,555, would indicate the predicted number of twin births in 1980. The
    slope, 2618.3, is the predicted increase in the number of twin births per year.
    c. To predict the number of twin births in 2010, I will plug 30 into my regression equation,
    and get Births (hat) = 64,555 + 2618.3 * 30, which equals approximately 143,104.
    Extrapolation is always risky, but 2010 is close enough to our data set, and the number
    certainly seems reasonable.
    Week 1 Problem Set:
    1.
    A bakery is trying to predict how many loaves to bake. In the last 100 days, they have
    sold between 95 and 140 loaves per day. Here is the histogram of the number of
    loaves they sold for the last 100 days.
    2.
    a. Describe the distribution.
    b. Which should be larger, the mean number of sales or the median? Explain.
    Average daily temperatures in January and July for 60 large US cities are graphed in
    the histogram below.
    3.
    a. What aspect of these histograms makes it difficult to compare the
    distributions?
    b. What differences do you see between the distributions of January and July
    average temperatures?
    Roger Maris’s 1961 home run record stood until Mark McGwire hit 70 in 1998. Listed
    below are the home run totals for each season McGwire played. Also listed are Babe
    Ruth’s home run totals.
    a. Find the 5-number summary for McGwire’s career.
    b. Do any of his seasons appear to be outliers? Explain.
    c. McGwire played in only 18 games at the end of his first big league season, and
    missed major portions of some other seasons because of injuries to his back
    4.
    5.
    6.
    and knees. Those seasons might not be representative of his abilities. They are
    marked with asterisks in the list above. Omit these values and make parallel
    boxplots comparing McGwire’s career to Babe Ruth’s.
    d. Write a few sentences comparing the two sluggers.
    Here is a stem-and-leaf display showing profits as a percent of sales for 29 of the
    Forbes 500 largest US corporations. The stems are split; each stem represents a span
    of 5%, from a loss of 9% to a profit of 25%.
    a. Find the 5-number summary.
    b. Draw a boxplot for these data.
    c. Find the mean and standard deviation.
    d. Describe the distribution of profits for these corporations.
    Every year US News and World Report published a special issue on many US colleges
    and universities. The scatterplots have Student/Faculty Ratio for the colleges and
    universities on the y-axes plotted against the other 4 variables. The correct
    correlations for these scatterplots appear in this list. Match them.
    Here are the scatterplot and regression analysis for Case Prices of 36 wines from
    vineyards in NY State and the Ages of the vineyards.
    7.
    8.
    a. Does it appear that vineyards in business longer get higher prices for their
    wines? Explain.
    b. What does this analysis tell us about vineyards in the rest of the world?
    c. Write the regression equation.
    d. How valuable is this equation? Explain.
    One Thursday, researchers gave students in a section of Spanish a set of 50 new
    vocabulary words to memorize. On Friday students took a vocabulary test. When they
    returned to class the next Monday, they were retested – without advance warning.
    Here are the test scores for the 25 students.
    a. What is the correlation between Friday and Monday scores?
    b. What does a scatterplot show about the association between the scores?
    c. Write the equation of the regression line.
    d. Predict the Monday score of a student who earned a 40 on Friday.
    The table below shows a 50-state average of the percent of expectant mothers who
    smoked cigarettes during their pregnancies.
    a. Create a scatterplot and describe the trend you see.
    b. Find the correlation.
    c. Write a linear model and interpret the slope in context.

    Save Time On Research and Writing
    Hire a Pro to Write You a 100% Plagiarism-Free Paper.
    Get My Paper

    Order a unique copy of this paper

    600 words
    We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
    Total price:
    $26
    Top Academic Writers Ready to Help
    with Your Research Proposal

    Order your essay today and save 25% with the discount code GREEN