CIS 5450 Homework 4: Machine Learning
Due Date: November 15th at 10:00PM EST, 103 points total (= 85 autograded + 18 manually graded).
Imports/Setup
Run the following cells to set up the notebook.
Before you begin:
· Be sure to click "Copy to Drive" to make sure you're working on your own personal version of the homework
· Check the pinned FAQ post on Ed for updates! TAs work really hard to keep it updated with everything you might need to know or anything we might have failed to specify. Writing these HWs and test cases gets tricky since students always end up implementing solutions that we did not anticipate and thus
could not have prepared the grader correctly for.
· WARNING: You MUST check that your notebook displays ALL visualizations on the Gradescope preview AND verify that the autograder finishes running and gives you your expected score (not a 0). (Ed #251 (https://edstem.org/us/courses/44790/discussion/3426442)).
Penalty: -10: If we have to resubmit your notebook to Gradescope for you after the deadline. (e.g. not naming your files correctly, not submitting .py and . ipynb , etc.).
Penalty: -5: If your notebook fails show up in the Gradescope preview of your .ipynb (e.g. Large File Hidden Error ). o If you experience this issue, please try to remove print outputs the non-plot images in the notebook.
Note: We will be manually checking your implementations and code for certain problems. If you incorrectly implemented a procedure using Scikit-learn and/or MLlib (e.g. creating predictions on training dataset, incorrectly process training data prior to running certain machine learning models,
hardcoding values, etc.), we will beenforcing a penalty system up to the maximum value of points allocated to the problem. (e.g. if your problem is worth 4 points, the maximum number of points that can be deducted is 4 points).
Note: If your plot is not run or not present after we open your notebook, we will deduct the entire manually graded point value of the plot. (e.g. if your plot is worth 4 points, we will deduct 4 points).
Note: If your .py file is hidden because it's too large, that's ok! We only care about your .ipynb file.
Please make sure you enter your 8 digit Penn ID in the student ID field below.
In [ ]:
%%capture
!pip install penngrader-client
from penngrader.grader import *
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO #TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = # YOUR PENN-ID GOES HERE AS AN INTEGER
|
File "<ipython-input-1-e0216b785a82>", line 7
STUDENT_ID = # YOUR PENN-ID GOES HERE AS AN INTEGER
^
SyntaxError: invalid syntax
In [ ]:
%%writefile notebook-config.yaml
grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23' grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'
|
Writing notebook-config.yaml
In [ ]:
grader = PennGrader('notebook-config.yaml', 'cis5450_fall24_HW4', STUDENT_ID, STUDENT_ID)
|
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-bfd95eeea170> in <cell line: 1>()
----> 1 grader = PennGrader('notebook-config.yaml', 'cis5450_fall24_HW4', STUDENT_ID, STUDENT_ID) NameError: name 'PennGrader' is not defined
Part 0: Set up GPU capabilities (1 point)
The cell below configures a CUDA device for use with PyTorch, if available. Remember to enable the GPU in Colab:
Go to Runtime -> Change runtime type -> GPU.
Part I: Preprocessing and Modeling in scikit-learn (45 points)
1.1 Data Loading and Preprocessing [0 Points]
1.1.1 Read and Load Data
We are using a CSV for this part, winequalityN.csv from a Kaggle dataset (https://www.kaggle.com/datasets/dataregress/rajyellow46/wine-quality) . The dataset contains 13 columns and over 6000 wine entries.
To get the data in here:
1. Go to this Kaggle link (https://www.kaggle.com) and create a Kaggle account (unless you already have one)
2. Go to Account and click on "Create New API Token" to get the API key in the form of a json file kaggle.json
3. Upload the kaggle.json file to the default location in your Google Drive (Please DO NOT upload the json file into any specific folder as it will be difficult for us to debug issues if you deviate from these instructions!).
This can be helpful for your project if you decide to use Kaggle for your final project or for future projects!
1.1.2 Understanding Data
A good practice before approaching any data science problem, is to understand the data you will be working with. This can be through descriptive statistics, datatypes, or just a quick tabular visualization. We will be walking through such tasks through Pandas.
Let's also verify if there are any null values in our dataset. We will remove all null rows in wine_quality_df
1.2 EDA [subtotal 14 points]
Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
1.2.1 Visualization [10 points]
(a) Quality Distribution [4 Points]
Task: Find the distribution of the quality of wine in our dataset. The range of values should be integers in the range of 3-9, so we're expecting to have one bar per quality. You are required to use the Seaborn library for this problem to create a countplot (https://seaborn.pydata.org/generated/seaborn.countplot.html) .
Requirements:
You should use wine_quality_df for this problem. Your plot must:
· Be of size (8,6) and use palette = 'viridis' . (ignore the deprecated warning) · Have appropriate titles and labels.
· Be clearly legible and should not have overlapping text or bars.
(b) 3D Scatterplot [6 Points]
Task: We want to examine the relationship between three variables: alcohol , pH , and density . We also want to examine quality as well. You are required to use the Matplotlib library for this problem to create a 3D Scatterplot (https://matplotlib.org/stable/gallery/mplot3d/scatter3d.html) .
Requirements:
You should use wine_quality_df for this problem. Your plot must:
· Be of size (6,6).
· Have each data point be colored accordingly by quality . The color mapping should be: 1-5 is red and 6-10 is green. · Have alcohol content in the x-axis, pH level in the y-axis, and density in the z-axis.
· Have appropriate titles, axes labels, and a legend.
· Be clearly legible and should not have overlapping text or bars. Very Helpful Resources:
· 3D Scatter Plotting in Python using Matplotlib (https://www.geeksforgeeks.org/3d-scatter-plotting-in-python-using-matplotlib/)
· List of named colors (https://matplotlib.org/stable/gallery/color/named_colors.html)
Since the dataset is large, we'll first sample 200 rows (roughly 10%) from our wine_quality_df using the random seed 42 and save the sampled dataframe into sample_wine_quality_df .
1.2.2 Correlation of Feature Variables [4 Points]
With multiple features, it can be somewhat exhausting to do bivariate analysis on every possible pair of features. While you certainly should, your first instinct should be to check for the correlation between features since certain models (e.g. Linear Regression) won't work well if we have strong multicollinearity.
Before finding our correlation matrix, we should filter out categorical features. Although quality is technically a categorical feature, we'll keep that column for now (and encode it later down the line). Drop any other categorical features and save this new dataframe into num_df .
Correlation Heatmap
Task: Create a correlation matrix using num_df and call it corr_mat . Using the correlation matrix, generate a correlation heatmap for these numeric features. You are required to use Seaborn library to create this heatmap (https://seaborn.pydata.org/generated/seaborn.heatmap.html) .
Make sure your correlation heatmap meets the following criteria:
· Ensure that your heatmap is sized (8,8): all feature labels should be visible on both the $x$-axis and $y$-axis · Use the RdBu color map to ensure that negative correlations are red and positive correlations are blue
· Standardize the color scale so that -1 takes the darkest red color, 0 is totally white, and +1 takes the darkest blue color
(2 manually graded points)
As an added exercise, based off of the correlation matrix above, write down what you believe to be the two most highly correlated pairs of features (by magnitude), and briefly explain the numerical intuition of what that correlation means. Note that you don't need any scientific explanation for why those variables are correlated.
Pair #1:
Pair #2:
1.3 Feature Encoding [subtotal 8 points]
1.3.1 Encoding Wine Type [4 Points]
Encoding is a process by which categorical variables are converted into a form. that could be provided to ML algorithms to do a better job in prediction.
Task:
· You should use wine_quality_df for this problem.
· Let's first determine the number of unique values for the type column and save that value in a new constant NUM_UNIQUE_TYPES
Since there are two unique values for wine quality (red and white), let's write a helper function to convert a string wine quality to an integer. As an example, the function can convert a wine type, like "white" to 0 and "red" to 1.
Now, let's make a copy of wine_quality_df into encoded_wine_quality_df and encode the type column to numerical values, and rename the type column to red_wine .
1.3.2 Encode Classes in 'Quality' Column [4 Points]
Task: We will be predicting the quality for our classification problem. We first want to transform our target into numerical values. Map the classes in the quality column in the following way:
. 0-5: 0 . 6-10: 1
This encoding represents low and high quality respectively. These will be the two classes we will try to predict using the other features about the wine. You should use encoded_wine_quality_df for this problem. Save your results in encoded_wine_quality_df .
1.4 Random Forest Classification (sklearn) [23 points]
1.4.1 Preprocessing: Create Features and Target and Split Data into Train and Test [4 Points]
Now that we have explored and cleaned our dataset, let's prepare it for a machine learning task. In this homework, you will work with various models and attempt to predict the quality of the wine.
The features will be all the variables in the dataset except quality , which will act as the label for our problem. First, store these two as features (pd.DataFrame) and target (pd.Series), respectively.
Now, use Scikit-learn's train_test_split (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split data
for classification into training and testing sets. The split should be 80-20 meaning 80% for training and the rest for testing.
_IMPORTANT_: Please set the seed variable to 42 and then set the parameter to random_state = seed and store the resulting splits as X_train, X_test, y_train, and y_test .
If you want to understand the purpose of seed, please feel free read over this concise yet thorough explanation on StackOverflow (https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do).
Let's also use a StandardScaler to standardize the set of X values. Make sure that there's no data leakage, in that the scaler should be trained ONLY on the training data. Name into X_train_scaled and X_test_scaled .
1.4.2 Random Forest Classification without Grid Search [4 points]
Raw Random Forest Classifier
Fit a Random Forest classifier on the X_train and y_train with the hyperparameters provided below. Calculate the accuracy of the model on the test set using the score method and store it in a variable named rf_acc . We're later going to use grid search to tune the hyperparameters, but for now, let's use the parameters below.
Task:
· Read the Scikit-learn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for Random Forest Classifier.
· For hyperparameters, set:
class_weight = 'balanced'
· Train the random forest classifier model and evaluate it using the score method. · Save your score in a variable rf_acc .
· Use the scaled X data for all remaining sections
1.4.3 Random Forest Classification with Grid Search and Cross Validation [15 points]
Now, we're interested in tuning the hyperparameters of the random forest model to see if we're able to achieve a higher testing score. We will be using sklearn's
GridSearchCV utility to do this. After defining a set of parameters and respective values to check, grid search will check through all combinations of those parameters and test using a subset of the training data (cross validation). To learn more about GridSearchCV , we've attached the documentation here
(https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html)
Complete the following:
1. First, let's define the parameter grid. We're interested in tuning the following set of hyperparameters below with their corresponding ranges of values. Name the parameter grid param_grid .
· class_weight : ['balanced'] · random_state : [42]
· Second, let's instantiate our random forest model in the variable random_forest_model with default initialization.
· Then, define the GridSearchCV object, using the estimator random_forest_model , the param grid defined above, a cv (cross validation) set to 5, scoring set to'accuracy', and verbose set to True
· Then, fit the model and printout the best parameters / cross validation score.
· Finally, we'll use our best model and evaluate it against our test data. Save the accuracy into test_accuracy
1.4.4 Random Forest Feature Importance [5 points]
Now, let's find the relative feature importance for predicting the quality of wine. Use the best model from above, and create a Seaborn bar plot to display the feature importance. Save the feature importances into feature_importances .
Specifications for plot:
· The feature importances should be sorted in descending order
· Use a Seaborn bar plot. If you're confused on the syntax, the documentation is here (https://seaborn.pydata.org/generated/seaborn.barplot.html) · Use a figure size of (10, 6)
· Properly label the title and axes.
1.4.5 Random Forest Confusion Matrix [4 points]
Finally, we will make use of a confusion matrix. It is used to consolidate the predictive performance of a model into a single table. In a binary classification scenario, it looks like this:
Evaluate the performance using sklearn's confusion matrix like before, and we'll use a seaborn heatmap to display our results. Save the confusion matrix into conf_matrix . Use sklearn's confusion_matrix utility.
Additionally, we will use the following set of parameters for our display:
1. set annot=True
2. fmt='d'
3. colormap set to 'Blues'
4. colorbar set to False
5. x and y tick labels set to True
6. Axes and title labels
1.4.6 Confusion Matrix Interpretation [Manually Graded 2 points]
From the confusion matrix above, we see that the model is relatively balanced in its predictions, despite there being a class imbalance in our data. What technique did we use in previous steps helped address the class imbalance and why did it help?