Assignment 4
EMATM0061: Statistical Computing and Empirical Methods, TB1, 2024
Introduction
This assignment is mainly based on Lectures 9 and 10. It is recommended that you watch the video lectures before starting.
Create an R Markdown for the assignment
It is a good practice to use R Markdown to organise your code and results. For example, you can start with the template called Assignment02_TEMPLATE.Rmd which can be downloaded via Blackboard. If you try to write mathematical
expressions in R Markdown, examples can be found in the document “Assignment_R MarkdownMathformulasandSymbolsExamples.rmd” (under the resource list tab on blackboard course webpage).
You can optionally submit this assignment by 13:00 Monday 14th October, which will help us understand your work but will not count towards your final grade. The submission point can be found under the assignment tab at Blackboards (click the title “Assignment 04”).
Load packages
Load the tidyverse package:
library(tidyverse)
Additionally, download the file called “HockeyLeague.xlsx” from Blackboards which will be needed by some of the questions in this assignment.
1. Tidy data and iteration
Tidy data and iteration has been introduced in Lecture 9.
1.1. Missing data and iteration
In this task we investigate the effect of missing data and writing iterations in R.
(Q1) The following function performs imputation by mean. What library do we need to load to run this function?
impute_by_mean<-function(x){
mu<-mean(x,na.rm=TRUE) # first compute the mean of x impute_f<-function(z){ # coordinate-wise imputation
if(is.na(z)){
return(mu) # if z is na replace with mean
}else{
return(z) # otherwise leave in place
}
}
return(map_dbl(x,impute_f)) # apply the map function to impute across
vector
}
(Q2) Create a function called “impute_by_median” which imputes missing values based on the median of the sample, rather than the mean.
You can test your function on the following sample vector:
v<-c(1,2,NA,4)
impute_by_median(v)
## [1] 1 2 2 4
(Q3) Next generate a data frame. with two variables x andy. For our first variable x we have a sequence (x1,x2, ⋯ ,xn) where x1 = 0, xn = 10 and for each i = 1, ⋯ ,n − 1, xi+1 = xi + 0.1. For our second variable y we set yi = 5xi + 1 for i = 1, ⋯ ,n.
Generate data of this form. and place within a data frame called “df_xy” .
df_xy %>% head(5)
## x y
## 1 0.0 1.0
## 2 0.1 1.5
## 3 0.2 2.0
## 4 0.3 2.5
## 5 0.4 3.0
(Q4)
map2: The “map2()” function is similar to the “map()” function but iterates over two variables in parallel rather than one. You can learn more here
https://purrr.tidyverse.org/reference/map2.html. The following simple example shows you how “map2_dbl()” can be combined with the “mutate()” function.
df_xy %>%
mutate(z=map2_dbl(x,y,~.x+.y)) %>%
head(5)
## x y z
## 1 0.0 1.0 1.0
## 2 0.1 1.5 1.6
## 3 0.2 2.0 2.2
## 4 0.3 2.5 2.8
## 5 0.4 3.0 3.4
We will now use map2_dbl() to generate a new data frame with missing data.
First create a function “sometimes_missing” with two arguments: “index” and “value”. The “function” should return “NA” if index is divisible by 5 and return “value” otherwise.
Your function should produce the following outputs:
sometimes_missing(14,25)
## [1] 25
sometimes_missing(15,25)
## [1] NA
Next generate a new data frame called “df_xy_missing” with two variables x andy, but some missing data. For the first variable x we have a sequence (x1, ⋯ ,xn), which is precisely the same as with “df_xy” . For the second variable y we have a sequence (y1, ⋯ ,yn) where yi = NA if i is divisible by 5 and otherwise yi = 5xi + 1. To generate the data frame. “d_xy_missing” you may want to make use of the functions “row_number(), map2_dbl(), mutate()” as well as “sometimes_missing()” .
Check that the first ten rows of your data frame are as follows:
df_xy_missing %>% head(10)
## x y
## 1 0.0 1.0
## 2 0.1 1.5
## 3 0.2 2.0
## 4 0.3 2.5
## 5 0.4 NA
## 6 0.5 3.5
## 7 0.6 4.0
## 8 0.7 4.5
## 9 0.8 5.0
## 10 0.9 NA
(Q5) Create a new data frame. “df_xy_imputed” with two variables x andy. For the first variable x we have a sequence (x1, ⋯ ,xn), which is precisely the same as with “df_xy” . For the second variable y we have a sequence (y′1, ⋯ ,y ′n) which is formed from (y1, ⋯ ,yn) by imputing all its missing values with the median. To generate
“df_xy_imputed” from “df_xy_missing” by applying a combination of the functions “mutate()” and “impute_by_median()” .
The first part of the data frame should look like this:
## x y
## 1 0.0 1.0
## 2 0.1 1.5
## 3 0.2 2.0
## 4 0.3 2.5
## 5 0.4 26.0
## 6 0.5 3.5
1.2 Tidying data with pivot functions
In this task you will read in data from a spreadsheet and apply some data wrangling tasks to tidy that data.
First download the excel spreadsheet entitled ““HockeyLeague.xlsx”“ . The excel file contains two spread-sheets - one with the wins for each team and one with the
losses for each team. To read this spreadsheet into R we shall make use of
the”readxl” library. You may need to install the library: install.packages(“readxl”)
The following code shows how to read in a sheet within an excel file as a data frame. You will need to edit the “folder_path” variable to be the directory which contains your copy of the spreadsheet
library(readxl) # load the readxl library
folder_path <- "./"
#folder_path<-"C:/Users/" # set this to the name of the
# directory containing "HockeyLeague.xlsx"
file_name<-"HockeyLeague.xlsx" # set the file name
file_path<-paste(folder_path,file_name,sep="") # create the file_path
wins_data_frame<-read_excel(file_path,sheet="Wins") # read of a sheet
from an xl file
Inspect the first 3 rows of the first five columns:
wins_data_frame. %>%
select(1:5)%>%
head(3)
## # A tibble: 3 × 5
## ...1 `1990` `1991` `1992` `1993`
##
## 1 Ducks 30 of 50 11 of 50 30 of 50 12 of 50
## 2 Eagles 24 of 50 12 of 50 37 of 50 14 of 50
## 3 Hawks 20 of 50 22 of 50 33 of 50 11 of 50
A cell value of the form. “a of b” means that games were won out of a total of b for that season. For example, the element for the “Ducks” row of the “1990” column is “30 of 50” meaning that 30 out of 50 games were won that season.
Is this tidy data?
(Q1) Now apply your data wrangling skills to transform the ““wins_data_frame””
data frame object into a data frame called ““wins_tidy”” which contains the same
information but has just four columns entitled ““Team”“,”“Year”“,”“Wins”“,”“Total”“ .
The”“Team”” column should contain the team name, the ““Year”” column should
contain the year, the ““Wins”” column should contain the number of wins for that
season and the ““Total”” column the total number of games for that season. The first column should be of character type and the remaining columns should be of integer
type. You can do this by combining the following functions: “rename()”, “pivot_longer()”, “mutate()” and “separate()” .
You can check the shape of your data frame. and the first five rows as follows:
wins_tidy %>% dim() # check the dimensions
## [1] 248 4
wins_tidy%>%head(5) # inspect the top 5 rows
## # A tibble: 5 × 4
## Team Year Wins Total
##
## 1 Ducks 1990 30 50
## 2 Ducks 1991 11 50
## 3 Ducks 1992 30 50
## 4 Ducks 1993 12 50
## 5 Ducks 1994 24 50
(Q2) The ““HockeyLeague.xlsx”” also contains a sheet with the losses for each team by season. Apply a similar procedure to read the data from this sheet and transform that data into a data frame called ““losses_tidy”” with four columns:
““Team”“,”“Year”“,”“Losses”“,”“Total”” which are similar to those in the ““wins_tidy”” data frame except for the ““Losses”” column gives the number of losses for a given season and team, rather than the number of losses.
Your results should look like this:
losses_tidy %>% head(5)
## # A tibble: 5 × 4
## Team Year Losses Total
##
## 1 Ducks 1990 20 50
## 2 Ducks 1991 37 50
## 3 Ducks 1992 1 50
## 4 Ducks 1993 30 50
## 5 Ducks 1994 7 50
You may notice that the number of wins plus the number of losses for a given team, in a given year does not add up to the total. This is because some of the games are neither wins nor losses but draws. That is, for a given year the number of draws is equal to the total number of games minus the sum of the wins and losses.
(Q3) Now combine your two data frames, ““wins_tidy”” and ““losses_tidy”“, into a single data frame entitled”“hockey_df”” which has 248 rows and 9 columns: A
““Team”” column which gives the name of the team as a character, the ““Year””
column which gives the season year, the ““Wins”” column which gives the number of wins for that team in the given year, the “Losses” column which gives the number of losses for that team in the given year and the ““Draws”” column which gives the
number of draws for that team in the given year, the ““Wins_rt”” which gives the wins as a proportion of the total number of games (ie. “Wins/Total”) and similarly the ““Losses_rt”” and the ““Draws_rt”” which gives the losses and draws as a
proportion of the total, respectively. To do this you can make use of the “mutate()” function. You may also want to utilise the “across()” function for a slightly neater solution.
The top five rows of your data frame. should look as follows:
hockey_df %>% head(5)
## # A tibble: 5 × 9
## Team Year Wins Total Losses Draws Wins_rt Losses_rt Draws_rt
##
## 1 Ducks 1990 30 50 20 0 0.6 0.4 0
## 2 Ducks 1991 11 50 37 2 0.22 0.74 0.04
## 3 Ducks 1992 30 50 1 19 0.6 0.02 0.38
## 4 Ducks 1993 12 50 30 8 0.24 0.6 0.16
## 5 Ducks 1994 24 50 7 19 0.48 0.14 0.38
(Q4) To conclude this task generate a summary data frame which displays, for each team, the median win rate, the mean win rate, the median loss rate, the mean loss rate, the median draw rate and the mean draw rate. The number of rows in your
summary should equal the number of teams. These should be sorted in descending order or median win rate. You may want to make use of the following functions:
“select()”, “group_by()”, “across()”, “arrange()” .
## # A tibble: 8 × 7
## Team W_md W_mn L_md L_mn D_md D_mn
##
## 1 Eagles 0.45 0.437 0.25 0.279 0.317 0.284
## 2 Penguins 0.45 0.457 0.3 0.310 0.133 0.232
## 3 Hawks 0.417 0.388 0.233 0.246 0.32 0.366
## 4 Ducks 0.383 0.362 0.34 0.333 0.25 0.305
## 5 Owls 0.32 0.333 0.3 0.33 0.383 0.337
## 6 Ostriches 0.3 0.309 0.4 0.395 0.267 0.296
## 7 Storks 0.3 0.284 0.22 0.283 0.48 0.433
## 8 Kingfishers 0.233 0.245 0.34 0.360 0.4 0.395
1.3 Simulation experiments of probabilities
(Q1) The following code was used in last week’s assignment to compare a
theoretical probability with an estimated probability in sampling with replacement. Now that we have learnt iterations with the map functions, can you rewrite the code using “map()” (and its variants)?
num_red_balls<-3
num_blue_balls<-7
total_draws<-22
prob_red_spheres<-function(z){
total_balls<-num_red_balls+num_blue_balls
log_prob<-log(choose(total_draws,z))+
z*log(num_red_balls/total_balls)+(total_draws
z)*log(num_blue_balls/total_balls)
return(exp(log_prob))
}
itermap <- function(.x, .f) {
result <- list()
for (item in .x) { result <- c(result, list(.f(item))) }
return(result)
}
itermap_dbl <- function(.x, .f) {
result <- numeric(length(.x))
for (i in 1:length(.x)) { result[i] <- .f(.x[[i]]) }
return(result)
}
num_trials<-1000 # set the number of trials
set.seed(0) # set the random seed
num_reds_in_simulation <- data.frame(trial=1:num_trials) %>%
mutate(sample_balls = itermap(.x=trial, function(x){sample(10,22,
replace = TRUE)})) %>%
mutate(num_reds = itermap_dbl( .x=sample_balls, function(.x)
sum(.x<=3) ) ) %>%
pull(num_reds)
prob_by_num_reds <- data.frame(num_reds=seq(22)) %>%
mutate(TheoreticalProbability=prob_red_spheres(num_reds)) %>%
mutate(EstimatedProbability=
itermap_dbl(.x=num_reds, function(.x)
sum(num_reds_in_simulation==.x))/num_trials)
(Q2) Next, we can make use of “pivot_longer” and “ggplot” to create a plot to compare the theoretical probability with the estimated probability in the data frame “prob_by_num_reds”. Your plot should look similar to the one below.
prob_by_num_reds %>%
pivot_longer(cols=c("EstimatedProbability","TheoreticalProbability"),
names_to="Type",values_to="count") %>%
ggplot(aes(num_reds,count)) +
geom_line(aes(linetype=Type, color=Type)) +
geom_point(aes(color=Type)) +
scale_linetype_manual(values = c("solid", "dashed"))+
theme_bw() + xlab("Number of reds") + ylab("Probabilities")
Try to make sense of each line in the above code.