Assignment 5
PS 3780 Data Literacy & Visualization, Autumn 2022
Due Date: Friday, March 3, 2023 at 11:59 p.m.
Please include your visualizations and answers to these questions as one .pdf
le (use the save as function in most word processors). Be sure to include your name,
your teammate’s name if there is anyone, and the assignment number. Submit the le
to Carmen by the due date. Remember we are looking for professional visualizations so
please include a meaningful title as well as axis labels and a legend.
Analyzing TV Shows
In this assignment, we will complete a mini research program: collect data, summarize
data, visualize data. We will evaluate the theory that well-rated television shows consistently have more episodes than lower rated ones. We will use the IMDB data collected
in Assignment 3. If you did not properly complete that assignment, collect the data by
following the posted scraping solution videos.
1
Getting Data: Webscraping IMDB
An example of what the datasets should look like before loading them into R.
1.1 Example Final Datasets
ParseHub
1
webscraper.io
2
Formatting and Summarizing the data in R
2.1 Formatting (1 pt)
Show the code used for each step.
1. Load each dataset in separately so that there are three in your global environment. Create the tv dataset, combining the three genre datasets with the rbind( )
command.
2. Use the unique( ) command to lter out entries that are exact duplicates. Make
sure to store this rened dataset in memory.
3. Using the log( ) command,
transform and save a new version of the episodes
variable. Make sure this new variable is saved within the dataset with a new name.
4.
Create and save a subset of the full dataset that selects only the rst time a show
appears in the list. Use indexing and the duplicate( ) command; it will look like:
tv[ !duplicated( tv$name ) , ]
(for dataset named `tv’ and show column `name’)
2.2 Summarizing (1.5 pt)
For this section, use the subset created in the last step of Part 2.1.
1. Summarize the tv show ratings and the number of episodes using the summary( )
command. Does anything stand out?
2. Correlate the tv show rating with the number of episodes and number of seasons
using the cor( ) command. Does there appear to be a relationship between rating
and either episodes or seasons?
3. What is the average rating for shows that ran for 2 seasons? 3 seasons? 4 seasons?
Are these calculations consistent with the correlation found in question 2?
2
3
Plotting in R
3.1 Popularity and Episodes (2.5 pt)
For this section, use the subset created in the last step of Part 2.1.
Make a scatterplot comparing the rating of a show to the logged number of episodes.
Specify two attributes of the graph to indicate the genre of the show and the number
of seasons for the show. Make sure your graph has professional titles and labels. Write
a paragraph describing the apparent relationship (or lack thereof ) between these three
variables.
Remember the primary parts of a scatterplot that we can use to represent data
include: the x-axis, the y-axis, color, size, and shape.
Think about which graph
attribute will work best for the dierent variable types.
3.2 Popularity by Genre (2 pt)
For this section, use the full dataset from after the log( ) command; Part 2.1.3.
Select just the genres with at least 100 shows in the dataset. Plot overlapping density
curves for the logged number of episodes for shows within each genre. Make sure each
density curve is visible.
Write a paragraph describing the dierences and similarities
between the genres.
The table( ) command can help you decide which genres to focus on.
When I
completed it, there were 6 genres with more than 100 shows. To follow the example
from class, create one subset which includes only the selected genres of shows. That
subset will be the foundation of the plot.
3