Algorithm: Let us use a simple algorithm such that, for each user u, the algorithm recommends
N = 10 users who are not already friends with u, but have the most number of mutual friends
in common with u.
Output:
• The output should contain one line per user in the following format:
<User><TAB><Recommendations>
where <User> is a unique ID corresponding to a user and <Recommendations> is a
comma separated list of unique IDs corresponding to the algorithm’s recommendation of
people that <User> might know, ordered in decreasing number of mutual friends.
• Note: The exact number of recommendations per user could be less than 10. If a user has
less than 10 second-degree friends, output all of them in decreasing order of the number of
mutual friends. If a user has no friends, you can provide an empty list of recommendations.
If there are recommended users with the same number of mutual friends, then output those
user IDs in numerically ascending order.
2
Pipeline sketch: Please provide a description of how you used Spark to solve this problem.
Don’t write more than 3 to 4 sentences for this: we only want a very high-level description of
your strategy to tackle this problem.
Tips:
• Use Google Colab to use Spark seamlessly, e.g., copy and adapt the setup cells from Colab
0.
• Before submitting a complete application to Spark, you may go line by line, checking the
outputs of each step. Command .take(X) should be helpful, if you want to check the
first X elements in the RDD.
• For sanity check, your top 10 recommendations for user ID 1571 should be: 35, 247, 716,
719, 1526, 1527, 1528, 1529, 1530, 1531.
• The execution may take a while.
• You can also create a toy test dataset (e.g., using Figure 1) to help you debug the program.
What to submit
You need to submit the following three files:
1. A short writeup contains
• Q1(a): The total sales for the 3 types, respectively. (5 pts)
• Q1(b): The average sales on Holidays vs. Non-Holidays. (5 pts)
• Q2: A short paragraph sketching your spark pipeline. (12 pts)
• Q2: The recommendations for the users with following user IDs: 10, 152, 288, 603,
714, 1525, 2434, 2681. (6 pts for each, 48 pts in total)
2. Your code for Q1. (10 pts)
3. Your code for Q2. (20 pts)