STATS 220
SEMESTER ONE, 2022
STATISTICS
Data Technologies
i This exam has been designed for you to complete without running any R or SQL code.
If you choose to develop or check your answers by running R or SQL code, please be advised that we take no responsibility for any issues you face with sourcing the data used within the exam, or with using any tools to run the code (e.g. R Studio or the STATS 220 lab task code boxes).
Also note that you are expected to use code approaches demonstrated within STATS 220. Use of other R code approaches may receive no recognition in terms of marks awarded for answers.
i In Lab task 5A, you were given access to the ZOOM participation data for STATS 220 this semester.
Recall that the actual names of participants were removed and replaced with “private” names, such as student 4, student77, etc.
For reference, the first 10 rows of the data frame zoom_data are shown below.
1
The data frame zoom_data contains
rows and
The result of running the R code zoom_data$private_name[4] would be
The result of running the R code zoom_data$guest %>% unique() %>% length() would be
Maximum marks: 5
2 Anna challenged STATS 220 students to determine which “student” she was in the ZOOM participation data.
Suppose that a STATS 220 student used the strategy of finding the five highest (longest) participation times for participants that were not guests.
The code below provides the code that the student wrote, but some parts of the code have been replaced with numbers e.g. {1}.
zoom data %>%
{1}(guest {2} "No") %>%
arrange({3}(participation_time_minutes)) %>%
{4}(1 : 5)
Use the boxes below to enter the missing function, operator, argument name or value.
{1}
{2}
{3}
{4}
Maximum marks: 4
3 The visualisation below was created to compare the mean participation times, as well as the shortest and longest participation times, for each lecture.
The data frame. summary_data was used to create the visualisation above.
Describe how you could use functions from {dplyr} to manipulate the data frame zoom_data to create the data frame summary_data.
4 In no more than three sentences, describe what changes you would make to improve the visualisation shown in Q3.
Refer to the grammar of graphics in your description and explain how your proposed changes would better communicate a story visually.
Maximum marks: 3
5 The visualisation below was created using zoom_data to compare the number of participants at each ZOOM lecture.
Only participants who had times of more than 25 minutes were included in the visualisation.
The code below provides the code used to create the visualisation above, but some parts of the code have been replaced with numbers.
participate_data <- zoom_data %>%
filter(participation_time_minutes > {1}) %>%
count(date_lecture)
ggplot(data = participate_data) +
geom_{2}(aes(x = {3},
y = n,
{4} = date_lecture),
{5} = "identity") +
labs(title = "Participation in STATS 220 ZOOM lectures",
subtitle = "Based on participant times of more than 25 minutes",
x = "Date of lecture",
y = "Number of students") +
scale y continuous(breaks = seq(0, 130, 20)) +
guides(fill = "none")
Use the boxes below to enter the missing function, operator, argument name or value.
{1}
{2}
{3}
{4}
{5}
Maximum marks: 5
6 For Assignment 3, you sourced data about books from the Google Books API in a JSON data format.
Use the example of JSON shown above to describe TWO key features of JSON syntax and ONE potential difficulty with creating a data frame. with JSON data.
Maximum marks: 3
7 Data was obtained from the Spotify API for the STATS 220 birthday songs playlist.
The app from Assignment 4 was used to download JSON data about songs on the playlist and was saved as a file called “spotify.json” .
Below are the JSON data for two of the songs in the playlist.
The JSON data for all the songs in the playlist was read into data frame spotify_data using a function from {jsonlite}.
R code and functions from {dplyr} and {lubridate} were then used to create a new data
frame. months_released by manipulating spotify_data to find the number of songs on the playlist that were released in each month of the year, and the mean popularity for songs/tracks released for each month.
The code below provides the code used to create months_released, but some parts of the code have been replaced with numbers.
spotify_data <- {1}("spotify.json")
months_released <- {2} %>%
{3}(release_date = ymd(release_date),
release_month = month(release_date, label = TRUE)) %>%
{4}(release_month) %>%
{5}(num_songs = n(),
mean_popularity = mean(track_popularity))
Use the boxes below to enter the missing function, operator, argument name or value.
{1}
{2}
{3}
{4}
{5}
Maximum marks: 5