代写ECE 219 Winter 25 Large-Scale Data Mining: Models and Algorithms Project 1代做Python语言

2025-01-24 代写ECE 219 Winter 25 Large-Scale Data Mining: Models and Algorithms Project 1代做Python语言

Large-Scale Data Mining: Models and Algorithms

ECE 219 Winter 25

Project 1: End-to-End Pipeline to Classify News Articles

Due Jan 24, 2025, by 11:59 PM

Overview

Statistical classification broadly refers to the task of learning to identify a subset of categories that pertain to a data point (sample of text, an image, a video clip, a time-signal etc..) from a predefined (generally human-guided) larger set of categories. The model attempts to master the task given a training data set (that is kept separate from an evaluation set) in which each data point is pre-labeled with their “correct” category membership/s. In this project, we deal with the classification of text data. The project consists of building an end-to-end pipeline to classify samples of news articles and involves the following ML components:

1. Feature Extraction: Construction of TF-IDF representations of textual data;

2. Dimensionality Reduction: Principal Component Analysis (PCA) and non-Negative Matrix Factorization (NMF) - Generally necessary for classical ML methods

3. Application of Simple Classification Models: Applying common classification meth-ods to the extracted features such as Logistic/Linear Classification and Support Vector Machines;

4. Evaluation the Pipeline: Evaluating and diagnosing classification results using Grid-Search and Cross Validation;

5. Replace corpus-level features with pretrained features: To apply pre-training to a downstream classification task and evaluate comparative performance.

Getting familiar with the dataset

Please access the dataset at this link. We are using a custom dataset that was designed specifically for this quarter. Note: Do not attempt to open the downloaded file in Excel - the formatting of file when visualized in Excel might suggest the data is corrupted, but this is not true. Consider exploring the dataset using Pandas. You might find the following Pandas functions helpful (in no specific order): read csv, head, hist, shape.

QUESTION 1: Provide answers to the following questions:

• Overview: How many rows (samples) and columns (features) are present in the dataset?

• Histograms: Plot 3 histograms on : (a) The total number of alpha-numeric characters per data point (row) in the feature full text: i.e count on the x-axis and frequency on the y-axis; (b) The column leaf label – class on the x-axis; (c) The column root label – class on the x-axis.

• Interpret Plots: Provide qualitative interpretations of the histograms.

The two sets of labels leaf label and root label are hierarchically arranged as follows:

root_label                                           sports                                                                climate

leaf_label             basketball   baseball   tennis   football   soccer          forest fire   flood   earthquake   drought   heatwave

Binary Classification

For the first part of the project, we will be using only the full text column as the raw features per sample (row) and the root label column as the label for each sample. The root labels are well-separated. Before continuing on, please set the random seed as follows to ensure consistency:

import numpy as np

import random

np.random.seed(42)

random.seed(42)

1 Splitting the entire dataset into training and testing data

In order to measure the performance of our binary classification model, we split the dataset into a training and a testing set. The model is trained on the training set and evaluated on the testing set. Note: Do not train on the testing set. We create the sets with a Pandas dataframe. input that contains the entire dataset df. Please make sure that the random seeds are set and the fraction of the test set is 0.2:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df[["full_text","root_label"]], test_size=0.2)

train and test contain the dataframes containing specific rows pertaining to the training data and testing data respectively.

QUESTION 2: Report the number of training and testing samples.

2 Feature Extraction

The primary step in classifying a corpus of text is choosing a good representation of each data point. Since the full text column contains the raw features to describe each data point, we seek a feature extraction module that encodes raw text features into processed computationally compatible features.

A good representation should retain enough class-discriminating information post-processing to competitively perform. classification, yet in the meantime, be concise to avoid computational intractability and over fitting.

A first model: The Bag-of-Words (BOW): One feature extraction technique is the “Bag of Words” model, where a document – in this case the full text feature of one data point – is represented as a histogram of word frequencies, or other statistics, within a fixed vocabulary of words. The vocabulary is aggregated from the training set (process described below).

Compiling the Vocabulary: (a) Each raw feature (text segment) is split into sentences; (b) Each sentence is split into words; (c) The list of words PER sentence are passed jointly into a position tagger to identify the nouns, verbs, adjectives etc.; (d) These tags are used to lemmatize/stem each word; (e) Words are filtered: Very rare or very frequent words (the number of documents they occur in, or the number of times they occur within a document) are removed, digits and punctuation-dominant words are removed, and stopwords, words contained in a database of common words are removed; (f) Remaining words are added to the vocabulary. Say that the selected set of words that compose the vocabulary form. a set W. For a dataset D, the processed features can be collectively represented as a data matrix X ∈ R |D×|W|. So each row captures the count of each word (the histogram) per data point.

An Example: If a test data point’s raw feature contains the text,

“On Saturday, the NHL hockey team went to the school to volunteer and educate.

Outreach is required for hockey, which is dying in popularity in recent years.”

and W = [“hockey′′ , “volunteer′′ , “sport′′], the row in X corresponding to this data point would be [2, 1, 0] because “hockey” appears twice, “volunteer” appears once, and “sport” does not appear at all (though it might appear in another data point. Remember the vocabulary is aggregated across the training set.)

During Testing: Each text sample in the testing set is similarly processed to those during training; however words are no longer added to W – this was fixed during training. Instead if a word that exists in the vocabulary occurs in the processed words from a testing sample, its count is incremented in the resulting feature matrix X.

To avoid adding to the vocabulary during testing and to avoid training on the test set in general please note as a rule: For most of this project you will be using the NLTK and sci-kit learn (sklearn) libraries for text processing and classifier design. In sklearn, in particular, only use the functions fit transform. and fit on the training set and only use transform. on the testing set.

The better model: The Term Frequency-Inverse Document Frequency Model (TF-IDF): “document” and “data point” are used interchangeably While Bag-of-Words model con-tinues to be used in many feature extraction pipelines, a normalized count vector that not only counts the number of times a word occurs in a document (the term frequency) but also scales the resulting count by the number of documents the word appears in (the document frequency) might provide a more valuable feature extraction approach.

The focus on the document frequency (and more correctly the inverse of it) encourages the feature vector to discriminate between documents as well as represent a document. TF-IDF does not only ask “What is the frequency of different words in a document?” but rather “What is the frequency of words in a document specific to that document and which differentiates it from other documents? A human reading a particular news article in the sports section will usually ignore the contextually dominant words such as “sport”, “competition”, and “player” despite these words being frequent in every news article in the sports section.

Such a context-based conditioning of information is widely observed. The human perception system usually applies a saturating function (such as a logarithm or square-root) to the actual input values into the vision model before passing it on to the neuronal network in the brain. This makes sure that a contextually dominant signal does not overwhelm the decision-making processes in the brain. The TF-IDF functions draw their inspiration from such neuronal sys-tems. Here we define the TF-IDF score to be:

TF-IDF(d, t) = TF(t, d) × IDF(t)

where TF(d, t) represents the frequency of word (processed, lemmatized, otherwise filtered) t in document d, and inverse document frequency is defined as:

IDF(t) = log (DF(t)/1) + 1

where n is the total number of documents, and df(t) is the document frequency, i.e. the number of documents that contain the word t.

import re

def clean(text):

text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

texter = re.sub(r"<br />", " ", text)

texter = re.sub(r""", "\"",texter)

texter = re.sub(''', "\"", texter)

texter = re.sub('\n', " ", texter)

texter = re.sub(' u '," you ", texter)

texter = re.sub('`',"", texter)

texter = re.sub(' +', ' ', texter)

texter = re.sub(r"(!)\1+", r"!", texter)

texter = re.sub(r"(\?)\1+", r"?", texter)

texter = re.sub('&', 'and', texter)

texter = re.sub('\r', ' ',texter)

clean = re.compile('<.*?>')

texter = texter.encode('ascii', 'ignore').decode('ascii')

texter = re.sub(clean, '', texter)

if texter == "":

texter = ""

return texter

QUESTION 3: Use the following specs to extract features from the textual data:

• Before doing anything, please clean each data sample using the code block provided above. This function helps remove many but not all HTML artefacts from the crawler’s output. You can also build your own cleaning module if you find this function to be ineffective.

• Use the “english” stopwords of the CountVectorizer

• Exclude terms that are numbers (e.g. “123”, “-45”, “6.7” etc.)

• Perform. lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag

• Use min df=3

Please answer the following questions:

• What are the pros and cons of lemmatization versus stemming? How do these processes affect the dictionary size?

• min df means minimum document frequency. How does varying min df change the TF-IDF matrix?

• Should I remove stopwords before or after lemmatizing? Should I remove punctuations before or after lemmatizing? Should I remove numbers before or after lemmatizing? Hint: Recall that the full sentence is input into the Lemmatizer and the lemmatizer is tagging the position of every word based on the sentence structure.

• Report the shape of the TF-IDF-processed train and test matrices. The number of rows should match the results of Question 2. The number of columns should roughly be in the order of k×103 . This dimension will vary depending on your exact method of cleaning and lemmatizing and that is okay.

The following functions in sklearn will be useful: CountVectorizer, TfidfTransformer, About Lemmatization and for the daring, Pipeline. Please refer to the discussion section notebooks for more guidance.

3 Dimensionality Reduction

After applying the above operations, the dimensionality of the representation vectors (TF-IDF vectors) is large. Classical learning algorithms, like the ones required in this section, however, may perform. poorly with such high-dimensional data. Since the TF-IDF matrix is sparse and low-rank, as a remedy, one can project the points from the larger dimensional space to a lower dimension.

In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI) and Non-negative Matrix Factorization (NMF), both of which minimize the Mean Squared residual Error (MSE) between the original TF-IDF data matrix and a reconstruction of the matrix from its low-dimensional approximation. Recall that our data is the term-document TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e.

Latent Semantic Indexing (LSI): The LSI representation is obtained by computing left and right singular vectors corresponding to the top k largest singular values of the term-document TF-IDF matrix X.

We perform. Singular Value Decomposition (SVD) on the matrix X, resulting in X = UΣVT where U and V orthogonal. Let the singular values in Σ be sorted in descending order, then the first k columns of U and V are called Uk and Vk respectively. Vk consists of the principle components of matrix X in the feature space.

Then we use (XVk) (which is also equal to (UkΣk)) as the dimension-reduced data matrix, where rows still correspond to documents, only now each data point can be represented in a (far) lower dimensional space. In this way, the number of features is reduced. LSI is similar to Principal Component Analysis (PCA), and you can see the lecture notes for their relationships.

Having obtained U and V, to reduce the test data, we ONLY multiply the test TF-IDF matrix Xt by Vk, i.e. Xt,reduced = XtVk. By doing so, we actually project the test TF-IDF vectors onto the previously learned principle components from training, and use the projections as the dimensionality-reduced data.

Non-negative Matrix Factorization (NMF): NMF tries to approximate the data matrix X ∈ R n×m (n = |D| docs and m = |W| terms) with WH (W ∈ R n×r , H ∈ R r×m). Concretely, it finds the non-negative matrices W and H s.t. ∥X − WH∥ 2 F is minimized (∥A∥F ≡ ).

Then we use W as the dim-reduced data matrix, and in the fit step, we calculate both W and H. The intuition behind this is that we are trying to describe the documents (the rows in X) as a (non-negative) linear combination of r topics:

Here we see h T 1 , . . . , h T r as r “topics”, each of which consists of m scores, indicating how impor-tant each term is in the topic. Then x T i ≈ wi1h T 1 + wi2h T 2 + · · · + wirh T r , i = 1, . . . , n.

Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative) linear combinations of the “topics” we learned in the fit step. The “topics”, again, are the rows of H matrix, {h T i } r i=1. How do we do that? Just solve the optimization problem

where H is fixed as the H matrix we learned in the fit step. Then Wt is used as the dim-reduced version of Xt .

QUESTION 4: Reduce the dimensionality of the data using the methods above:

• Plot the explained variance ratio across multiple different k = [1, 5, 10, 25, 50, 100, 500, 1000] for LSI and for the next few sections choose k = 25. What does the explained variance ratio plot look like? What does the plot’s concavity suggest?

• With k = 25 found in the previous sections, calculate the reconstruction residual MSE error when using LSI and NMF – they both should use the same k = 25. Which one is larger, the ∥X − WH∥ 2 F in NMF or the X − UkΣkVT k2F in LSI and why?

4 Classification Algorithms

In this part, you are asked to use the dimensionality-reduced training data from LSI with your choice of k to train (different types of) classifiers, and evaluate the trained classifiers with test data. Your task would be to classify the documents into two classes (for now a binary classification task) sports versus climate.

Classification Measures: Classification quality can be evaluated using different measures such as precision, recall, F-score, etc. Refer to the discussion material to find their definition.

Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have different levels of significance. In order to characterize the trade-off between the two quantities, we plot the receiver operating characteristic (ROC) curve. For binary classification, the curve is created by plotting the true positive rate against the false positive rate at various threshold settings on the probabilities assigned to each class (let us assume probability p for class 0 and 1 − p for class 1). In particular, a threshold t is applied to value of p to select between the two classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for each value of t. The ROC is the curve of TPR plotted against FPR.

Support Vector Machines (SVM): Linear Support Vector Machines are seemingly effi-cient when dealing with sparse high dimensional datasets, including textual data. They have been shown to have good generalization and test accuracy, while having low computational complexity.

These models learn a vector of feature weights, w, and an intercept, b, given the training dataset. Once the weights are learned, the label of a data point is determined by thresholding wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data point belongs to either class, by applying a logistic function instead of hard thresholding, i.e. calculating σ(wTx + b).

The learning process of the parameter w and b involves solving the following optimization problem:

where xi is the ith data point, and yi ∈ {0, 1} is its class label.

Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data. On the other hand, minimizing the first term, which is basically a regularization term, corresponds to maximizing the margin between the two classes. Note that in the objective function, each slack variable represents the amount of error that the classifier can tolerate for a given data sample. The trade-off parameter γ controls relative importance of the two components of the objective function. For instance, when γ ≫ 1, misclassification of individual points is highly penalized. This is called “Hard Margin SVM”. In contrast, a “Soft Margin SVM”, which is the case when γ ≪ 1, is very lenient towards misclassification of a few individual points as long as most data points are well separated.

QUESTION 5: Compare and contrast hard-margin and soft-margin linear SVMs:

• Train two linear SVMs:

– Train one SVM with γ = 2000 (hard margin), another with γ = 0.0005 (soft margin).

– Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of both SVM classifiers on the testing set. Which one performs better? What about for γ = 100000?

– What happens for the soft margin SVM? Why is the case? Analyze in terms of the confusion matrix.

∗ Does the ROC curve reflect the performance of the soft-margin SVM? Why?

• Use cross-validation to choose γ (use average validation accuracy to compare): Using a 5-fold cross-validation, find the best value of the parameter γ in the range {10k | − 3 ≤ k ≤ 6, k ∈ Z}. Again, plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this best SVM.

Logistic Regression: Logistic regression is a probability model that can be used for binary classification. In logistic regression, a logistic function (σ(ϕ) = 1/(1 + exp (−ϕ))) acting on a linear function of the features (ϕ(x) = wTx + b) is used to calculate the probability that a data point belongs to a particular binary class, and during the training process, parameters w and b that maximize the likelihood of predicting the correct labels on the training data are learned.

One can also add a regularization term to the objective function, so that the goal of the training process is not only in maximizing the likelihood, but also in minimizing the regularization term, which is often some norm of the parameter vector w. Adding regularization helps prevent ill-conditioned results and over-fitting, and facilitates generalization. A coefficient is used to control the trade-off between maximizing likelihood and minimizing the regularization term.

QUESTION 6: Evaluate a logistic classifier:

• Train a logistic classifier without regularization (you may need to come up with some way to approximate this if you use sklearn.linear model.LogisticRegression); plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this classifier on the testing set.

• Find the optimal regularization coefficient:

– Using 5-fold cross-validation on the dimension-reduced-by-SVD training data, find the op-timal regularization strength in the range {10k |−5 ≤ k ≤ 5, k ∈ Z} for logistic regression with L1 regularization and logistic regression with L2 regularization, respectively.

– Compare the performance (accuracy, precision, recall and F-1 score) of 3 logistic classi-fiers: w/o regularization, w/ L1 regularization and w/ L2 regularization (with the best parameters you found from the part above), using test data.

– How does the regularization parameter affect the test error? How are the learnt coeffi-cients affected? Why might one be interested in each type of regularization?

– Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary. What is the difference between their ways to find this boundary? Why do their performances differ? Is this difference statistically significant?

Na¨ıve Bayes Model: Scikit-learn provides a suite of Na¨ıve Bayesian classifiers including MultinomialNB, BernoulliNB, and GaussianNB. Na¨ıve Bayes classifiers use the assumption that features are statistically independent of each other when conditioned by the class the data point belongs to, in order to simplify the calculation for the Maximum a posteriori (MAP) estimation of the labels. That is,

P(xi | y, x1, . . . , xi−1, xi+1, . . . , xm) = P(xi | y)   i ∈ {1, . . . , m}

where xi ’s are features and y is the label of the data point. The MultinomialNB, BernoulliNB, and GaussianNB use different underlying probability models.

QUESTION 7: Evaluate and profile a Na¨ıve Bayes classifier: Train a GaussianNB classifier; plot the ROC curve and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of this classifier on the testing set.