IEOR 242: Applications in Data Analysis, Spring 2021
Practice Midterm Exam 1
1 True/False and Multiple Choice Questions – 48 Points
Instructions: Please circle exactly one response for each of the following 12 questions. Each question is worth 4 points. There will be no partial credit for these questions.
1. Suppose that we train a classification model that has accuracy equal to 1 (i.e., perfect 100% accuracy) on the test set, and that the test set contains at least one positive observation and at least one negative observation. Then the TPR (true positive rate) of that model on the test set is also equal to 1.
A. True
B. False
2. Suppose that we train a classification model that has accuracy equal to 0 .99 on the test set, and that the test set contains at least one positive observation and at least one negative observation. Then, without any other information, the most definitive statement we can make about the TPR (true positive rate)
of that model on the test set is:
A. The TPR is equal to 0.99 B. The TPR is equal to 1
C. The TPR is at least 0.90
D. The TPR is between 0 and 1
3. Consider two linear regression models trained on the same training set. Model A uses 15 independent variables and has a training set R2 value of 0.79. Model B uses 10 independent variables and has a training set R2 value of 0.68. Then, when comparing the two models on the same test set, Model A must have a higher value of OSR2 than Model B.
A. True
B. False
4. The main purpose of bagging (bootstrap aggregating) is to estimate the out-of-sample error.
A. True
B. False
5. Boosting is inherently sequential since each new decision tree is trained in a way that uses information from the previously trained decision trees, whereas Random Forests is inherently parallelizable since each individual decision tree is trained independently of all the others.
A. True
B. False
6. In multiple linear regression (p > 1), it is possible for a subset of the independent variables to all have large VIF values and at the sametime have somewhat small pairwise correlation values with each other.
A. True
B. False
7. Suppose that LF N = 2 and LFP = 1. Let p denote the probability that a given observation is a positive. Then, in order to minimize expected cost, an optimal policy is to assign an observation as a positive if and only ifp is greater than 1/3.
A. True
B. False
8. The Random Forests method tends to produce many uncorrelated trees (which are then averaged to- gether) since:
A. Each individual tree is trained on a fresh bootstrap sample of the training set
B. When training each individual tree, only a randomly selected subset of the features are con- sidered at each split
C. Both (a) and (b) are true D. Both (a) and (b) are false
9. Suppose that we have a dataset consisting of n = 2; 342 observation vectors xi. We are interested in constructing between five to ten diferent clusters to assign each observation to. If we use the K-means algorithm for this task, then to select the final number of clusters K:
A. We must run the K-means algorithm twice, with K = 5 and then with K = 10
B. We must run the K-means algorithm only once with K = 10
C. We must run the K-means algorithm six times with K = 5; 6; 7; 8; 9; 10
D. The K-means algorithm will automatically choose the number of clusters K for us
10. Consider the following ROC curve based on a logistic regression model for predicting lung cancer – here having lung cancer is a “positive outcome.” The baseline is also drawn for comparison. Suppose that a doctor would like to minimize the number of times that she tells a patient that they do not have lung cancer when they actually do. At the same time, the doctor is only willing to incorrectly tell a patient that they have lung cancer when they actually do not at most 50% of the time. Then, which point on the ROC curve should the doctor use to determine the correct threshold value?
A. A
B. B
C. C
D. D
11. Suppose that, conditioned on Y = 1, X is normally distributed with mean 4 and variance 1. Similarly, conditioned on Y = 2, X is normally distributed with mean 5.5 and variance 1. Now, given a new observation X = x, we are interested in predicting whether Y = 1 or Y = 2. A threshold value of 4.35 is chosen, so that we predict Y = 2 if x ≥ 4.3 and Y = 1 if x < 4.3. This is represented pictorially in Figure 1.
Figure 1
Figure 2 defines five diferent shaded regions within Figure 1, and the letters refer also to their respective areas.
Figure 2
Suppose that Y = 1 corresponds to a positive outcome. Then the FPR (false positive rate) is equal to:
A. (C + D)/(A + B + C + D)
B. (C + D)/(A + B)
C. B/(D + E)
D. B/(B + D + E)
12. Figure 3 shows a time series plot for daily bike rentals in Washington DC’s Capital Bikeshare system.
Based on this plot, which of the following time series modeling methodologies are most appropriate:
A. A model with seasonality variables
B. An autoregressive model
C. A linear trend model
D. A model that incorporates all of the above
Figure 3
tota l_rentals
|
8000
6000
4000
2000
|
|
2011−01 2011−07 2012−01 2012−07
Date
|
2 Short Answer Questions – 52 Points
Instructions: Please provide justification and/or show your work for all questions, but please try to keep your responses brief. Your grade will depend on the clarity of your answers, the reasoning you have used, as well as the correctness of your answers.
The following questions are based on data from YourCabs.com, an online platform for matching the supply and demand for taxi cabs in Bangalore, India. Riders make booking requests on the YourCabs platform, and cab drivers are independent contractors who are linked to the riders via the YourCabs platform. Occasionally a matched driver may cancel a booked trip before the scheduled pick-up time. Often, these cancellations occur at the last minute before the scheduled pick-up time, or in fact the cancellation is more aptly a “no- show” on the part of the driver. YourCabs would like to examine the use of machine learning models for predicting whether or not booking requests will ultimately result in a cancellation by the driver. YourCabs has collected data concerning 3,375 booking requests that occurred in a particular area in Bangalore during 2013, and this data is summarized in Table 1.
Table 1: Description of the dataset.
Variable Description
VehicleModelId
|
Encodes the type of the driver’s vehicle (one of
14 possible values)
|
OnlineBooking
|
1 if the booking was made on the regular website,
0 if not
|
MobileSiteBooking
|
1 if the booking was made on the mobile version of the website, 0 if not
|
BookingDateTime
|
Date and time that the booking was made
(stored as a timestamp string such as “1/3/2013 19:13”)
|
TripDateTime
|
Scheduled date and time of the start time of the
trip (stored as a timestamp string such as “1/3/2013 19:13”)
|
Cancellation
|
1 if this booking request resulted in a cancellation by the driver, and 0 if not
|
1. (6 points) The dataset was randomly split into a training set and a test set, with 2,362 (about 70%) of the observations placed in the training set and 1,013 (about 30%) of the observations placed in the test set. Of the 2,362 total observations in the training set, only 80 observations correspond to cancellations while the remaining 2,282 observations were not cancellations. Of the 1,013 total observations in the test set, only 35 observations correspond to cancellations while the remaining 978 observations were not cancellations.
(a) (3 points) Consider a baseline model that does not use any features at all. What is the appropriate baseline model for this dataset?
(b) (3 points) What is the accuracy of the baseline model you selected in part (a) on the test set? What is its TPR (true positive rate) on the test set? What is its FPR (false positive rate) on the test set?
2. (8 points) YourCabs has recently begun exploring the idea of reassigning booking requests that are likely to result in cancellations to more “reliable” drivers. Towards this end, YourCabs has compiled a curated list of such “reliable” drivers, and, on average, drivers on this list cancel bookings only 0.1% of the time. However, there is a cost to reassigning booking requests. Namely, the original driver may be upset that the request was lost and may end up leaving the platform forever with a certain probability. YourCabs estimates that the average cost per reassignment is $10 (USD for simplicity). Naturally there is also a cost associated with each cancellation, due to the lost revenue that is incurred if the rider stops using the platform. YourCabs estimates that this average cost per cancellation is $100. Finally, YourCabs estimates that the average profit per successful ride is $25.
A decision tree capturing this analysis is shown in Figure 4. Determine a threshold value, pthresh , such that it is optimal (with regard to expected profit) to reassign a booking to a more “reliable” driver if and only if the probability of cancellation p exceeds pthresh.
Figure 4: Decision tree for possibly reassigning a booking to a more “reliable” driver. The leaf nodes represent profit values and p represents the probability that the booking request would result in a cancellation by the original driver.