Statistical Machine Learning GR5241
Spring 2023
Homework 5
Due: May 1st, 2023
Homework submission: Please submit your homework electronically through Canvas by 11:59pm on the due
date. You need to submit both the pdf file and your code (either in R or Python).
Problem 1 (Boosting, 30 points)
The objective of this problem is to implement the AdaBoost algorithm. We will test the algorithm on handwritten
digits from the USPS data set.
AdaBoost: Assume we are given a training sample (x
(i)
, yi), i = 1, ..., n, where x
(i)
are data values in R
d
and
yi ∈ {−1, +1} are class labels. Along with the training data, we provide the algorithm with a training routine for
some classifier c (the “weak learner”). Here is the AdaBoost algorithm for the two-class problem:
1. Initialize weights: wi = n/1
2. for b = 1, ..., B
(a) Train a weak learner cb on the weighted training data.
(b) Compute error:
(c) Compute voting weights:
(d) Recompute weights:
3. Return classifier
Decision stumps: Recall that a stump classifier c is defined by
Since the stump ignores all entries of x except xj , it is equivalent to a linear classifier defined by an affine
hyperplane. The plane is orthogonal to the jth axis, with which it intersects at xj = θ. The orientation of the
hyperplane is determined by m ∈ {−1, +1}. We will employ stumps as weak learners in our boosting algorithm.
To train stumps on weighted data, use the learning rule
In the implementation of your training routine, first determine an optimal parameter θj
∗
for each dimension
j = 1, ..., d, and then select the j
∗
for which the cost term in (2) is minimal.
Homework problems:
1.i (8) Implement the AdaBoost algorithm in R. The algorithm requires two auxiliary functions, to train and
evaluate the weak learner. We also need a third function which implements the resulting boosting classifier.
We will use decision stumps as weak learners, but a good implementation of the boosting algorithm should
permit you to easily plug in arbitrary weak learners. To make sure that is possible, please use function calls
of the following form.
• pars <- train(X, w, y) for the weak learner training routine, where X is a matrix the columns of
which are the training vectors x
(1)
, . . . , x
(n)
, and w and y are vectors containing the weights and class
labels. The output pars is a list which contains the parameters specifying the resulting classifier. (In
our case, pars will be the triplet (j, θ, m) which specifies the decision stump).
• label <- classify(X, pars) for the classification routine, which evaluates the weak learner on X
using the parametrization pars.
• A function c hat <- agg class(X, alpha, allPars) which evaluates the boosting classifier (“ag
gregated classifier”) on X. The argument alpha denotes the vector of voting weights and allPars
contains the parameters of all weak learners.
1.ii (6) Implement the functions train and classify for decision stumps.
1.iii (12) Run your algorithm on the USPS data (the digit data we used in Homework 4, use the training and
test data for the 5s and 6s) and evaluate your results using cross validation.
Your AdaBoost algorithm returns a classifier that is a combination of B weak learners. Since it is an
incremental algorithm, we can evaluate the AdaBoost at every iteration b by considering the sum up to
the b-th weak learner. At each iteration, perform. 5-fold cross validation to estimate the training and test
error of the current classifier (that is, the errors measured on the cross validation training and test sets,
respectively).
1.iv (4) Plot the training error and the test error as a function of b.
Submission. Please make sure your solution contains the following:
• Your implementation for train, classify and agg class.
• Your implementation of AdaBoost.
• Plots of your results (training error and cross-validated test error).
Problem 2 (Hierachical clustering, 5 points)
Perform. single-linkage hierarchical clustering on the following 2-dimensional observations. Use the Manhattan
distance between a pair of points: d(A, B) = |X1A − X1B| + |X2A − X2B|. Show your results after each step.
Problem 3 (Back Propagation, 15 points)
The dataset "BackPropData.csv" contains two features x1, x2 and dichotomous response y (coded y = 0 or
y = 1). Consider applying the vanilla one hidden layer neural network to classify dichotomous response y.
• One hidden layer with 3 nodes
• Bias for both the hidden layer and output layer
• Sigmoid activation function at the hidden layer
• Softmax output function
• Cross-entropy cost (or objective) Q
• 17 total parameters defined by θ
• θ is the collection of weights: W[1]
, b
[1]
, W[2]
, b
[2]
.
Required task: Compute ∇Q(θ
∗
) using backpropagation. The symbol ∇Q(θ
∗
) is the gradient of Q(θ) evaluated
at θ
∗
. Note that you are not solving the full gradient descent optimization problem - this problem is simplified as
a single required step in the gradient descent algorithm.
Please solve this problem manually. A loop is not required because the algorithm only iterates over k = 2, 1.
Display the full gradient ∇Q(θ
∗
) on your solution.
Problem 4 (Multinomial Clustering, 30 points)
In this exercise, we will apply the EM algorithm and a finite mixture of multinomial distributions to the image
segmentation problem. Image segmentation refers to the task of dividing an input image into regions of pixels that
“belong together” (which is about as formal and precise a definition as you will be able to find in the literature.)
One way to approach image segmentation is to extract a suitable set of features from the image and apply a
clustering algorithm to them. The clusters output by the algorithm can then be regarded as the image segments.
The features we will use are histograms drawn from small regions in the image.
Histogram clustering: The term histogram clustering refers to grouping methods whose input features are
histograms. A histogram can be regarded as a vector; if all input histograms have the same number of bins d, a
set of input histograms can be regarded as a set of vectors in R
d
. We can therefore perform. a simple histogram
clustering by running a k-means algorithm on this set of vectors. The method we will implement in this problem
is slightly more sophisticated: Since histograms are represented by multinomial distributions, we describe the
clustering problem by a finite mixture of multinomial distributions, which we estimate with the EM algorithm.
The EM algorithm for multinomial mixtures. The input of the algorithm is:
• A matrix of histograms, denoted H, which contains one histogram vector in each row. Each column corre
sponds to a histogram bin.
• An integer K which specifies the number of clusters.
• A threshold parameter τ .
The variables we need are:
• The input histograms. We denote the histogram vectors by H1, ..., Hn.
• The centroids t1, ..., tK. These are R
d
vectors (just like the input features). Each centroid is the parameter
vector of a multinomial distribution and can be regarded as the center of a cluster; they are computed as
the weighted averages of all features assigned to the cluster.
• The assignment probabilities a1, ..., an. Each of the vectors ai
is of length K, i. e. contains one entry for
each cluster k. The entry aik specifies the probability of feature i to be assigned to cluster k. The matrix
which contains the vectors ai
in its rows will be denoted P.
The algorithm iterates between the computation of the assignment probabilities ai and adjusting the cluster
centroids tk.
The algorithm:
1. Choose K of the histograms at random and normalize each. These are our initial centroids t1, ..., tK.
2. Iterate:
(a) E-step: Compute the components of each ai
, the assignment probabilities aik, as follows:
Note that ϕik is the multinomial probability of H
i
with parameters tk, up to the multinomial coefficient
n!
Hi1!···Hid!
, which cancels out in the computations of aik.
(b) M-step: Compute new mixture weights ck and class centroids tk as
(c) Compute a measure of the change of assignments during the current iteration:
where A is the assignment matrix (with entries aik), Aold denotes assignment matrix in the previous
step, and ∥ . ∥ is the matrix 1-norm (the largest sum of column absolute values). This is implemented
in R as norm( . ,"O").
3. Terminate the iteration when δ < τ .
4. Turn the soft assignments into a vector m of hard assignments by computing, for i = 1, ..., n,
mi
:= arg max
k=1,...,K
aik ,
i. e. the index k of the cluster for which the assignment probability aik is maximal.
The histogram data: The histograms we use have been extracted from an image by the following procedure:
1. Select a subset of pixels, called the sites, at which we will draw a histogram. Usually, this is done by
selecting the pixels at the nodes of an equidistant grid. (For example, a 2-by-2 grid means that we select
every second pixel in every second row as a site.)
2. Place a rectangle of fixed radius around the site pixel.
3. Select all pixels within the rectangle and sort their intensity values into a histogram.
The data we provide you with was drawn from the 800x800 grayscale image shown below (left):
The image is a radar image; more specifically, it was taken using a technique called synthetic aperture radar (or
SAR). This type of image is well-suited for segmentation by histogram clustering, because the local intensity dis
tributions provide distinctive information about the segments. The image on the right is an example segmentation
using K = 3 clusters, computed with the algorithm described above.
The histograms were drawn at the nodes of a 4-by-4 pixel grid. Since the image has 800 × 800 pixels, there are
200 × 200 = 40000 histograms. Each was drawn within a rectangle of edge length 11 pixels, so each histogram
contains 11 × 11 = 121 values.
Homework problems. Download the data file histograms.bin from the course homepage. You can load it in
R with the command
H<-matrix(readBin("histograms.bin", "double", 640000), 40000, 16)
which results in a 40000 × 16 matrix. Each row is a histogram with 16 bins. You can check that the matrix has
been loaded correctly by saying dim(H) (which should be 40000 rows by 16 columns) and rowSums(H) (the sum
of each row should be 121).
4.i (10) Implement the EM algorithm in R. The function call should be of the form. m<-MultinomialEM(H,K,tau),
with H the matrix of input histograms, K the number of clusters, and tau the threshold parameter τ .
4.ii (10) Run the algorithm on the input data for K=3, K=4 and K=5. You may have to try different values of τ
to obtain a reasonable result.
4.iii (10) Visualize the results as an image. You can turn a hard assignment vector m returned by the EM
algorithm into an image using the image function in R.
Hints:
• You may encounter numerical problems due to empty histogram bins. If so, add a small constant (such as
0.01) to the input histograms.
• Both the E-step and the M-step involve sums, which are straightforward to implement as loops, but loops
are slow in interpreted programming languages (such as R or Matlab). If you want to obtain a fast algorithm,
try to avoid loops by implementing the sums by matrix/vector operations.
Problem 5 (Implementing PageRank, 20 points)
In this problem, you will implement the PageRank algorithm. You will be experimenting with a small randomly
generated graph (assume graph has no dead ends) which can be downloaded on Canvas.
It has n = 100 nodes (numbered 1, 2, . . . , 100), and m = 1024 edges, 100 of which form. a directed cycle (through
all the nodes) which ensures that the graph is connected. There may be multiple edges between a pair of nodes,
your code should handle these instead of ignoring them. The first column in graph.txt refers to the source
node, and the second column refers to the destination node.
Assume the directed graph G = (V, E) has n nodes (numbered 1, 2, . . . , n) and m edges, all nodes have positive
out-degree, and A = (Aji)n×n is an n × n matrix as defined in class such that for any i, j ∈ {1, 2, . . . , n}:
Here, deg(i) is the number of outgoing edges of node i in G. Denote the PageRank vector by the column vector
r. By the definition of PageRank, we have the following equation:
where 1 is the vector of length n with all entries equal to 1.
Based on this equation, the iterative procedure to compute PageRank works as follows:
• Initialize: r
(0) = n/1 1.
• For i from 1 to k, iterate: r
(i) = n/α1 + (1 − α)Ar(i−1)
.
You need to do the following tasks:
1. (5) Implement the iterative procedure mentioned above to compute PageRank.
2. (5) Run the aforementioned iterative process for 40 iterations, assume α = 0.2 and obtain the PageRank
vector r.
3. (5) List the top 5 nodes ids with the highest PageRank scores (submit both ids and scores).
4. (5) List the bottom 5 node ids with the lowest PageRank scores (submit both ids and scores).