Assessment 3: Project
Assessment 3: Overview
Weight - 20%
Due - End of Week 6, Sunday 10.00 pm (Sydney time) Expected time
Allow approximately 25–30 hours to complete this assessment. Please note that the estimated
workload to complete the assessment may vary depending on the level of your technical background. What you need
The required software for the modelling in part A is available on the slide displayed underneath this overview slide.
Instructions
Part A) You will complete this part of the assessment in Ed. Choose one option from two options given
1. Option I: Foundations of Neural Networks
2. Option II: Tree and Ensemble Learning
You can use Python or R, or both depending on whatever is suitable. You can also use Python notebook and R Markdown for coding.
You are free to use your own IDE and PC. You just need to upload screenshots of code if you are not using Ed to run the code.
Part B) Write a report to describe the steps performed to develop the model and evaluate its
performance. Provide written justifications, with clearly articulated reasons, for the steps you took to build the model.
How to submil.
Part A (code and data) is submitted via the Ed learning platform. Part B (pdf report) is submitted through Turnitin on the Assessment submission page in Moodle.
In Ed, your can use model.py or model.r for the main code which should read data and run. Alternatively, you can also upload your code notebook. Upload Screenshots of Console if code does not run on Ed and you are using local machine to run the code.
We recommend that you use Scikit-learn in case of Python since it runs faster on Ed.
Do not include any code in your report that will be submitted to Moodle.
Marking and feedback
A rubric is available on Moodle assessment page as well as at the botton of the lesson here on Ed. Feedback and results will be provided to you 7-10 days (approximately) after the deadline. Please note that due to a large number of students, the feedback could be delayed.
FAQ
See Ed Discussions.
Option I - A: Foundations of Neural Networks
Imagine yourself as a data scientist and build a neural network model using a given datasets. Data set
Refer to the respective documentation for the given data set is given with the following link:
"Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem." Source
http://archive.ics.uci.edu/ml/datasets/Abalone
TEXT
Name / Data Type / Measurement Unit / Description
Sex / nominal / -- / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / gives the age in years
The readme file contains attribute statistics.
Instructions
Clean the above data sets with data processing code and then prepare them for modelling using (dense) neural networks. Build a neural network model either using Keras or scikit-learn in Python or R. Understand the given problem and identify the respective inputs and outputs of the proposed model.
converting in to classes
You treat the project as a classification problem. You show results for the ring age classified into 4 major groups, i.e. 4 output neurons using the following ring age groups:
. Class 1: 0 - 7 years
. Class 2: 8- 10 years
. Class 3: 11 - 15 years
· Class 4: Greater than 15 years
Include class distribution as part of the data visualization in Step 1 below.
Note that the response variable is continuous. However, in this assessment, the problem is a classification problem with four classes.
Steps to execute the project
Consider the following steps to build and evaluate the model:
1. Analyse and visualise the given datasets by reporting the distribution of classes, distribution of features and any other visualisation you find appropriate.
2. Develop a dense neural network with one hidden layer. Vary the number of hidden neurons to be 5, 10, 15, and 20 in order to investigate the performance of the model using Stochastic
Gradient Descent (SGD). Determine the optimal number of neurons in the hidden layer from the range of values considered.
3. Investigate the effect of learning rate (using SGD) for the selected dataset (using the optimal number of hidden neurons).
4. Investigate the effect on a diferent number of hidden layers: Now modify the model by adding another hidden layer. Use the optimal number of hidden neurons from Step 3 for both the
layers and the optimal learning rate from Step 4. Investigate the effect of this change in the number of hidden layers (using SGD).
5. Investigate the effect of Adam and SGD on training and test performance.
6. Take the final optimal model among all the above cases and show the confusion matrix and ROC/AUC curve for different classes of the multi-class problem.
Evaluate the optimal* model using the classification accuracy score on test data.
Note that Step 2 to 5 require 10 experimental runs (with different initial weights) for each case where you report mean and 95% confidence interval of accuracy. You need to select the appropriate metrics, i.e., for classification report performance on the train and test datasets. Use 60/40 percent train/test split for given data set (data split remains fixed across experiments). Note that there is no need to have a validation set.
Additional tasks (not a requirement and no extra marks will be given)
· You can also feature additional visualisation such as error plots on the train and test split for optimal model over time and any other visualisations for the training/test performance.
· Using Adam/SGD, compare L2 regularisation (weight decay) to dropout for selected hyper- parameters. Then compare with Adam with no regularisation.
· Hybrid Dropout and Weight Decay: Using Adam, compare L2 regularisation (weight decay) with dropouts. Show results for 3 different combinations (can be more) of hyperparameters
(dropout rate with weight decay hyper-parameter (λ) )
Installation: You should install required libraries and run the experiments on your personal computers and upload the results/code on Ed later. Note that the code will not be evaluated. Marks will be given only for your report. You can also submit a readme.txt with your submission that gives an overview of your files/code. The reason we need your code is for plagiarism check in case if we are suspicious about your report.
How to submit
Click on the submit button to submit your code.
In Ed, your can use model.py or model.r for the main code which should read data and run. Alternatively, you can also upload your code notebook. Upload Screenshots of Console if code does not run on Ed and you are using local machine to run the code.
We recommend that you use Scikit-learn in case of Python since it runs faster on Ed.
DISCUSSION
What is a good model? you need to decide that with trial runs, i.e, how many iterations are needed to get good performance on the train and test datasets. You can make convergence plots and then decide what is the best time to stop training.
Option I - B: Report Task
Write a report to describe the steps performed in Part A to develop the model and evaluate its performance.
Provide brief justifications, with clearly articulated reasons, for the steps you took to build the model you submitted. Please note that you are free to use your own writing style. and should provide references as needed. The following suggestions/guidelines are not mandatory and are provided mainly for informational purposes.
Suggestions/Guidelines for Presentation style/format
. Use IEEE Conference paper template in Latex or Word:
https://www.ieee.org/conferences/publishing/templates.html
https://www.overleaf.com/latex/templates/ieee-conference-template-example/nsncsyjfmpxy
. Your report should have the following sections: Problem definition (abstract and Introduction) and methodology, results, and conclusion. To get more information on these sections, click
here. You are encouraged to cite at least 10 references in your technical report. Note that
introduction highlights literature, aim and goals and the general problem you are trying to
solve. You are free to use different section title or report style, although this style. of reporting is encouraged.
. Quantitative information should be clearly described and appropriately communicated (e.g. using figures and tables that are appropriately labelled).
. There is no strict word limit.
. Your report should be written using correct spelling, grammar, and punctuation. . Follow IEEE referencing style.
. You need to submit code that runs in Ed. If your report takes into account N=10 experiments for example, but your code submitted needs to have N=1 in the for loop that repeats the
experiments. You should use functions/methods. You can upload results that are used to
generate the results - plots that will be part of your report and keep plots as a separate file code if you wish.
. You can also submit a readme.txt with your submission that gives an overview of your
files/code. Writing tips:
https://users.ics.aalto.fi/ntatti/howtowrite2016/tutorial.pdf
https://www.sciencedirect.com/science/article/pii/S1878764915001606
How to submit
This assessment is submitted through Turnitin on the Assessment submission page in Moodle.
Do not include any code in your report that will be submitted to Moodle.
Option II - A: Tree and Ensemble Learning
This option will feature components from decision trees, random forests, and ensemble learning.
Use the Abalone dataset given in Part A: Option I. Now you need to apply CART for the same problem and report the classification performance on the train and test set using the same train/test split. In addition to task 1 of Option I-A on analysis and visualization of the given datasets, execute the following tasks.
1. Report the Tree Visualisation (show your tree and also translate few selected nodes and leaves into IF and THEN rules)
2. Do an investigation about improving performance further by either pre-pruning or post-pruning the tree: https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
3. Apply Bagging of Trees via Random Forests and show performance (e.g., accuracy score) as
your number of trees in the ensembles increases. Carry out 10 experiments (minimum of 2 experiments is fine withdifferent random states in train/test split) in Task 2 and 3 and show performance accuracy with mean and confidence interval. Note that Task 2 may have same results for every experimental run.
4. Optional: Compare results with Adam and SGD (Neural Networks) and discuss them.
Note that performance refers to accuracy which could be either classification accuracy or F1 score.
Abalone dataset from UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Abalone. In this case, provide visualisation and analysis of your data first, as required in Part A: Option I.
Click on the submit button to submit your code.
In Ed, your can use model.py or model.r for the main code which should read data and run. Alternatively, you can also upload your code notebook. Upload Screenshots of Console if code does not run on Ed and you are using local machine to run the code.
We recommend that you use Scikit-learn in case of Python since it runs faster on Ed.
Option II - B: Report Task
Write a report to describe the steps performed in Part A to develop the model and evaluate its performance.
Provide brief justifications, with clearly articulated reasons, for the steps you took to build the model you submitted. Please note that you are free to use your own writing style. and should provide
references as needed. The following suggestions/guidelines are not mandatory and are provided mainly for informational purposes.
Suggestions/Guidelines for Presentation style/format
. Use IEEE Conference paper template in Latex or Word:
https://www.ieee.org/conferences/publishing/templates.html
https://www.overleaf.com/latex/templates/ieee-conference-template-example/nsncsyjfmpxy
. Your report should have the following sections: Problem definition (abstract and Introduction) and methodology, results, and conclusion. To get more information on these sections, click
here. You are encouraged to cite at least 10 references in your technical report. Note that
introduction highlights literature, aim and goals and the general problem you are trying to
solve. You are free to use different section title or report style, although this style. of reporting is encouraged.
. Quantitative information should be clearly described and appropriately communicated (e.g. using figures and tables that are appropriately labelled).
. There is no strict word limit.
. Your report should be written using correct spelling, grammar, and punctuation. . Follow IEEE referencing style.
. You need to submit code that runs in Ed. If your report takes into account N=10 experiments for example, but your code submitted needs to have N=1 in the for loop that repeats the
experiments. You should use functions/methods. You can upload results that are used to
generate the results - plots that will be part of your report and keep plots as a separate file code if you wish.
. You can also submit a readme.txt with your submission that gives an overview of your
files/code. Writing tips:
https://users.ics.aalto.fi/ntatti/howtowrite2016/tutorial.pdf
https://www.sciencedirect.com/science/article/pii/S1878764915001606
How to submit
This assessment is submitted through Turnitin on the Assessment submission page in Moodle.
Do not include any code in your report that will be submitted to Moodle.
Evaluation - Rubrics
A. Overall presentation
. The report has an excellent presentation. The introduction clearly defines the aim and goals of the report with a clear review of the literature. Results and discussion section has been presented very well. (25 %)
. The report has a good presentation. The introduction clearly defines the aim and goals of the report. Results and discussion section has been presented well but some issues present. (20%)
. The report has some presentation issues. The introduction does not clearly define the aim and goals of the report. Results and discussion section has not been presented very well. (15 %)
. The report has a poor presentation. The introduction has missing aim and goals. Results and discussion section is questionable or not complete. (10 %)
. No submission/results not correct (0 %)
B. Depth of discussion and presentation of results
· In-depth discussion & elaboration in all sections of the report. (25 %)
. In-depth discussion & elaboration in most sections of the report. (20 %)
. The writer has omitted pertinent content or content runs-on excessively. (15 %)
. Cursory discussion in all the sections of the report or brief discussion in only a few sections. (10 %)
. No submission/results. (0 %)
C. Cohesiveness
. Ties together information from all sources. Report flows from one issue to the next clearly.
Author's writing demonstrates an understanding of the relationship among material obtained from all sources. (25 %)
. For the most part, ties together information from all sources. Report flows with only some disjointedness. Author's writing demonstrates an understanding of the relationship among material obtained from all sources. (20 %)
. Sometimes ties together information from all sources. Report does not fl ow - disjointedness is apparent. Author's writing does not demonstrate an understanding of the relationship among material obtained from all sources. (15 %)
. Does not tie together information. Report does not flow and appears to be created from disparate issues. Headings are necessary to link concepts. Writing does not demonstrate understanding any relationships (10%)
. Not coherent/ no submission (0 %)
D. Sources and citations
. Relevant sources cited and provides a proper overview of sources in the text. (25 %)
. Relevant sources cited and provides a proper overview of sources in discussion but some minor issues present. (20 %)
. Sources are credible, however, mistakes in citations. (15 %)
. Does not cite and discuss the source properly. (10 %)
· No citations/submission (0%)
Note that plagiarism will automatically imply 0 marks.
Adapted from: https://www.cornellcollege.edu/library/faculty/focusing-on-assignments/tools-for- assessment/research-paper-rubric.shtml
Do not include any code in your report that will be submitted to Moodle.