COMP6311-Advanced Data Analytics
Assignment 1 (Due date: 23:59, 9 October 2023)
Introduction
Suppose there is a real estate company based in the United States, specializing in the sale of
apartments in various locations. In the real estate market, apartment prices are influenced by a
variety of factors, including location, size, noise level, air conditions, etc. To help real estate
investors make informed decisions, the company regularly releases information on apartment sales
in different areas. By providing comprehensive sales data, the company empowers investors to
design accurate and effective apartment price prediction systems. By analyzing sales data, investors can identify latent patterns and develop predictive models that are useful to make datadrivendecisions on apartment transactions. Datasets
The datasets are described as follows:
1. Train_Data.csv contains 4000 samples of estate basic information, and the target variable is
the Total Cost:
● Property size – number of rooms in the house.
● Community safety score – the higher the safer.
● Residence space – square feet area of the living rooms.
● Building space – square feet area of the whole building.
● Noise level – the lower the value, the greater the noise.
● Waterfront – If the house has water front or not.
● View – Number of viewings before the house is sold.
● Air quality index – the higher the value, the better the air quality.
● Aboveground area – square feet area of the above house.
● Basement area – square feet area of the basement in the house.
● Construction year – the year in which the house was built.
● Decoration year – the year in which the house was decorated.
● District – the address of the house.
● City – the city in which the house is located.
● Zip code – the zip code of the house.
● Region – the region of the house.
● Exchange rate – when the house is sold, the exchange rate between the US dollar and the
Hong Kong dollar.
● Unit price of residence space – the unit price of residence space (US dollar).
● Unit price of building space – the unit price of building space (US dollar).
● Total cost – the total price of residence and building space (Hong Kong dollar).
2. Test_Data.csv contains 400 samples of estate basic information and the total cost is unknown.
Task
Task 1: Total Cost Comparison
Please compare the average total cost (in Hong Kong dollar) between the “economical houses”
and all the houses in each city. Here, total cost = (unit price of residence space * residence space
+ unit price of building space * building space) * exchange rate, and an “economical house”
should satisfy the following two requirements:
(i) its construction year is after 1995 (not including 1995), and
(ii) its residence ratio = residence space / (residence space + building space) is greater than 24%
(not including 24%).
You are required to use MapReduce to conduct the calculation by 5 mappers and 2 reducers.
MapReduce example:
https://colab.research.google.com/drive/1cqgjCH9ZCXedswxmND5u3Ma3HIC68gkY?usp=shar
ing
This is an example of implementing MapReduce in Google Colab. You can freely access Colab
resources by logging in your Google account.
Task 2: Total Cost Classification
Suppose you are a real estate investor who does not know the unit price of the house (including
both residence space and building space). You need to remove columns Unit price of residence
space and Unit price of building space from Train_Data.csv, and design a machine learning/deep
learning model that predicts the total cost of each house. Then, you need to evaluate the model
performance by using Test_Data.csv.
You are only required to predict the price range of the total cost for each sample in Test_Data.csv.
The label is organized in four classes including:
Ø 1: it means the total cost is less than 300000HKD (i.e., 0 <= total cost < 300000).
Ø 2: it means the total cost is greater than or equal to 300000HKD and less than 500000HKD
(i.e., 300000 <= total cost < 500000).
Ø 3: it means the total cost is greater than or equal to 500000HKD and less than 700000HKD
(i.e., 500000 <= total cost < 700000).
Ø 4: it means the total cost is greater than or equal to 700000HKD (i.e., 700000 <= total
cost).
Submission Format
1. For task 1, first, save your MapReduce results in a single file named “mapreduce_result.csv”,
which contains columns of: (1) city, (2) average total cost of economical houses, and (3)
average total cost of all houses. Then, package your mapreduce_result.csv and MapReduce
source codes as a zip file. Please rename it as student_ID_task1.zip.
2. For task 2, use your designed model to predict the total cost, and fill your results in
Test_Data.csv. Then, package your Test_Data.csv and the source code of model training as a
zip file. Please rename it as student_ID_task2.zip.
Grading Criteria
● The program needs to be clearly annotated and a detailed Readme file should be provided.
● Task1: we will check the results and the logic you implement the MapReduce functions.
● Task2: we will compare your predicted results in Test_Data.csv with the Ground-truth values,
and the performance evaluation is based on the Top1-Accuracy.