DSCI550: Data Science at Scale
Homework 3, Spring 2024
SHOW EACH STEP OF COMPUTATION.
1. (25 pts) (Decision Tree) Using the following training dataset, construct a decision tree using Information Gain and Entropy as discussed in the class. Using V1, V2, V3, V4, predict C.
2. (20 pts) (Naïve Bayes Classifier) We have data on 1000 patients. They were diagnosed to be Flu, Allergy, or Other Disease using three symptoms as shown. This is our 'training set.' We will use this to predict the type of any new patient we encounter.
A new patient says “High fever, No Sneezing, and Runny Nose”. Is this Flu, Allergy, or Other? Use Naive Bayes Classifier.
3. (15 pts) Regression: A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars in K and S = Sales in thousands $.
1) (5 pts) Find the equation of the regression line, using Advertising dollars as the independent variable and Sales as the response variable.
2) (3 pts) Plot the scatter diagram and the regression line.
3) (5 pts) Find r2 and interpret it in the words of the problem.
4) (2 pts) Use the line to predict the Sales if Advertising dollars = $50 K.
4. (20 pts) (Hierarchical Clustering) Five, two-dimensional data points are shown below with their distance matrix, i.e., the symmetric matrix that gives the pairwise distance between any two points.
Use the distance matrix to perform. the following two types of hierarchical clustering: MIN and MAX distance. Show your results by drawing a dendrogram. Note: the dendrogram should clearly show the order in which the points are merged.
5. (20 pts) k-Means Clustering: For the following six points,
1) Use the k-means algorithm to show the final clustering result assuming initially we assign A1, A6 as the center of each cluster, respectively.
2) Use the k-means algorithm to show the final clustering result assuming initially we assign A3, A4 as the center of each cluster, respectively.
3) Compute the quality of the K-Means clustering using the Sum of Squared Error (SSE) which shows cohesion measures how near the data points in a cluster are to the cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster sum of squares.
where μi is the mean of points in Si.
Based on SSE of 1) and 2), which clustering would be better?