代写Data Mining for lecture

2023-10-12 代写Data Mining for lecture

Algorithm: Let us use a simple algorithm such that, for each user u, the algorithm recommends

N = 10 users who are not already friends with u, but have the most number of mutual friends

in common with u.

Output:

• The output should contain one line per user in the following format:

<User><TAB><Recommendations>

where <User> is a unique ID corresponding to a user and <Recommendations> is a

comma separated list of unique IDs corresponding to the algorithm’s recommendation of

people that <User> might know, ordered in decreasing number of mutual friends.

• Note: The exact number of recommendations per user could be less than 10. If a user has

less than 10 second-degree friends, output all of them in decreasing order of the number of

mutual friends. If a user has no friends, you can provide an empty list of recommendations.

If there are recommended users with the same number of mutual friends, then output those

user IDs in numerically ascending order.

2

Pipeline sketch: Please provide a description of how you used Spark to solve this problem.

Don’t write more than 3 to 4 sentences for this: we only want a very high-level description of

your strategy to tackle this problem.

Tips:

• Use Google Colab to use Spark seamlessly, e.g., copy and adapt the setup cells from Colab

0.

• Before submitting a complete application to Spark, you may go line by line, checking the

outputs of each step. Command .take(X) should be helpful, if you want to check the

first X elements in the RDD.

• For sanity check, your top 10 recommendations for user ID 1571 should be: 35, 247, 716,

719, 1526, 1527, 1528, 1529, 1530, 1531.

• The execution may take a while.

• You can also create a toy test dataset (e.g., using Figure 1) to help you debug the program.

What to submit

You need to submit the following three files:

1. A short writeup contains

• Q1(a): The total sales for the 3 types, respectively. (5 pts)

• Q1(b): The average sales on Holidays vs. Non-Holidays. (5 pts)

• Q2: A short paragraph sketching your spark pipeline. (12 pts)

• Q2: The recommendations for the users with following user IDs: 10, 152, 288, 603,

714, 1525, 2434, 2681. (6 pts for each, 48 pts in total)

2. Your code for Q1. (10 pts)

3. Your code for Q2. (20 pts)