STATS 2DA3 Fall 2024
ASSIGNMENT 3
Submit through Crowdmark.
Due before 5pm on Tuesday, October 29th.
1. (3 MARKS) Consider the two prediction tables below; The dataset consists of 400 ob- servations, 200 observations from Class 1 and 200 observations from Class 2 (the “True” classes). The predicted group memberships (predicted classes) are called A, B and C.
Table 1
|
A
|
B
|
C
|
Class 1
|
40
|
60
|
100
|
Class 2
|
200
|
0
|
0
|
Table 2
|
A
|
B
|
C
|
Class 1
|
50
|
50
|
100
|
Class 2
|
200
|
0
|
0
|
(a) Which table gives the best clustering result and why? (Think ARI or node purity.)
2. (2 MARKS)
(a) Consider the classification table below. What is the misclassification rate? (correct to the nearest percent)
Predicted
1 2 3
Actual 1 33 0 0
2 5 25 4
3 5 0 17
3. (13 MARKS) Using the fgl dataset from the MASS library, complete the following tasks: (Note: you may need to load additional libraries to answer the questions.)
(a) “Explore” your data using the head and str commands. How many observations are in the dataset?
(b) Set the seed to “1” and create a training set containing 120 observations selected at random. (The remaining observations should be used as your test set in future steps).
(c) Build a Linear Discriminant Analysis (LDA) model using the training set, with the goal of predicting type, using all other predictor variables.
(d) Predict the classes of the observations in your test set using your model, i.e. apply your model to the test set.
(e) Produce a classification table of your results.
(f) What is the ARI for this classification result, correct to 4 decimal places? (g) What is the miss-classification rate, correct to the nearest percentage?
4. (2 MARKS) Consider the image below that illustrates Hierarchical clustering applied to a dataset.
(a) Which Dendrogram shows the most evidence of Chaining?