代写CSC8630/CSC8635 – Machine Learning代写留学生Matlab程序

2024-07-08 代写CSC8630/CSC8635 – Machine Learning代写留学生Matlab程序

CSC8630/CSC8635 - Machine Learning

Resit specification: Machine Learning project

Submission will be via Canvas

The learning objectives of this assignment are:

1.   To learn about the design of machine learning analysis pipelines

2.   To understand how to select appropriate methods given the dataset type

3.   To learn how to conduct machine learning experimentation in a rigorous and effective manner

4.   To critically evaluate the performance of the designed machine learning pipelines

5.   To learn and practice the skills of reporting machine learning experiments

For this coursework you will be provided with a choice of four different datasets of different nature

1.   A tabular dataset (defined as a classification problem)

2.   A image dataset

3.   A text dataset

4.   A time series dataset

Your job is easy to state: You should pick one out of these four options and design a range of machine learning pipelines appropriate to the nature of each of the selected datasets. Overall, we expect that you will perform a thorough investigation involving (whenever relevant) all parts of a machine learning pipeline (exploration, preprocessing, model training, model interpretation and evaluation), evaluating a range of options for all parts of the pipeline and with proper hyperparameter tuning.

You will have to write a short report that presents the experiments you did, their justification, a detailed description of the performance of your designed pipelines using the most appropriate presentation tools (e.g., tables of results, plots). We expect that you should be able to present your work at a level of detail that would enable a fellow student to reproduce your steps.

1) Description and requirements for the tabular dataset

The dataset, called FARS, is a collection of statistics of US road traffic accidents. The class label is about the severity of the accident. It has 20 features and over 100K examples. The dataset is available in Canvas as a CSV file, in which the last column contains the class labels: https://ncl.instructure.com/courses/53509/files/7652449/download?download_frd=1

Experiments on the tabular dataset will be relatively fast compared to the other three options. To compensate, we expect that you evaluate a very broad range of options for the design of your  machine learning pipelines, including (but not limited to)   data  normalisation, feature/instance selection, class imbalance correction, several (appropriate) machine learning models, hyperparameter tuning and cross-validation evaluation.

2) Description and Requirements for the image dataset

The CIFARTile dataset is an extension to the CIFAR10 dataset. In each image there are four CIFAR10 images tiled in a grid. The idea is to predict the label. The label is the number of unique CFAR10 image classes within the tiled image subtract one. So, for example in Figure 1 below there are two images of birds, one of a frog and one of an automobile. Thus, three unique classes and hence the label is 2. More details on the dataset can be found on the page

https://github.com/RobGeada/cvpr-nas-datasets. However, please download your data from:

Train

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/train_x.npy

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/train_y.npy

Validate

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/valid_x.npy

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/valid_y.npy

Test

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/test_x.npy

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/CIFARTile/test_y.npy

Figure 1: Example image from the CIFARTile dataset, class label is 2

Some hints:

There’s a notebook

(http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/viewCIFARTile.ipynb) which shows you how to load and view the data.

−   To speedup your work here are some hints:

o    Make sure you set the Runtime type to either GPU or TPU.

o    Copy the data to your Google drive so you don’t have to keep uploading it.

o    As the dataset is large you might want to do some of your initial testing on a subset of the data.

You might consider cutting the image up into 4 and running each through a CFAR10 classifier. This is not allowed and will score you zero for Method.

3) Description and Requirements for the textdataset

Dataset: sentiment       analysis       dataset.       It       includes        a        training        set (https://ncl.instructure.com/files/7666186/download?download_frd=1),   a   development   set (https://ncl.instructure.com/files/7666193/download?download_frd=1),    and     a     test     set (https://ncl.instructure.com/files/7666197/download?download_frd=1).  Each sample in the dataset represents a tweet. Each tweet has a sentiment label (Positive, Negative, Neutral).

Task Description: Apply a combination of different approaches including pre-processing techniques, shallow and deep classifiers, ensembled approaches, machine learning approaches beyond supervised learning if applicable, data augmentation if applicable to predict the sentiment of the test set. Try your best to improve the prediction results.

Main Evaluation metrics: F-1 measure.

4) Description and Requirements for the time series dataset

The Weather dataset is a time-series dataset collected by a Raspberry Pi computer at a home in Newcastle. It contains a bunch  of different features about the weather collected over an approximate 12-month period. The features are as follows:

Column no

Feature

1

Date and time in standard Linux format

2

Temperature from the first internal sensor (Celsius)

3

Outside temperature (Celsius)

4

CPU Temperature (Celsius)

5

Count (always 1)

6

Temperature from the second internal sensor (Celsius)

7

Air Pressure (mmHg)

8

Humidity

Readings are measured in one-minute intervals between November 2021 and November 2022. Your task is to try and predict future values 5, 10, 15, 30 minutes into the future along with 1, 2, 6 and 12 hours into the future. You can do this for each of the 6 weather features (not date or count). You should separate out a test set of the last 2 months of data (you need to have a continuous and separate test set to prevent leakage between training and testing).

The dataset can be downloaded from:

http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/weather.csv

Some hints:

-     There is a notebook:

(http://homepages.cs.ncl.ac.uk/stephen.mcgough/data/weather.ipynb) which shows you how to load and view the data to get you started.

-     In order to score top marks for this dataset you should demonstrate multiple models, at least one of them should not use Deep Learning.

-     To speedup your work here are some hints:

o Make sure you set the Runtime type to either GPU or TPU.

o Copy the data to your Google drive so you don’t have to keep uploading it.

o As the dataset is large you might want to do some of your initial testing on a subset of the data.

Marking criteria

•   Writing Style, references, figures, etc. 10%

•   Dataset exploration 10%

•   Methods 30%

•   Results of analysis 30%

•   Discussion 20%

Deliverables

A finished report, addressing the marking scheme above together with the source code of your best pipelines for the selected dataset. The report should have 1000 to 2000 words. The word count excludes references, tables, figures and section headers, and has a 10% leeway.