辅导OMP4650、辅导Python编程设计

2022-09-28 辅导OMP4650、辅导Python编程设计
OMP4650/6490 Document Analysis – Semester 2 / 2022Assignment 2Due 17:00 on Wednesday 28 September 2022 AEST (UTC +10)Last updated August 22, 2022OverviewIn this assignment you will:1. Develop a better understanding of how machine learning models are trained in practice, includingpartitioning of datasets and evaluation.2. Become familiar with the scikit-learn1 package for machine learning with text.3. Become familiar with the PyTorch2 framework for implementing neural network-based machinelearning models.Throughout this assignment you will make changes to the provided code to improve or complete existingmodels. In some cases, you will write your own code from scratch after reviewing an example.Submission The answers to this assignment (including your code files) have to be submitted online in Wattle. You will produce an answers file with your responses to each question. Your answers file must bea PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID. You should submit a ZIP file containing all of the code files and your answers PDF file, BUT NODATA.MarkingThis assignment will be marked out of 15, and it will contribute 15% of your final course mark.Your answers to coding questions (or coding parts of each question) will be marked based on the qualityof your code (is it efficient, is it readable, is it extendable, is it correct).Your answers to discussion questions (or discussion parts of each question) will be marked based on howconvincing your explanations are (are they sufficiently detailed, are they well-reasoned, are they backedby appropriate evidence, are they clear, do they use appropriate visual aids such as tables, charts, ordiagrams).Question 1: Movie Review Sentiment Classification (4 marks)For this question you have been provided with a movie review dataset. The dataset consists of 50,000review articles written for movies on IMDb, each labelled with the sentiment of the review – eitherpositive or negative. Your task is to apply logistic regression with dense word vectors to the movie reviewdataset to predict the sentiment label from the review text.1https://scikit-learn.org/2https://pytorch.org/1A simple approach to building a sentiment classifier is to train a logistic regression model that usesaggregated pre-trained word embeddings. While this approach, with simple aggregation, normally worksbest with short sequences, you will try it out on the movie reviews.You have been provided with a Python file dense_linear_classifier.py which reads in the dataset andsplits it into training, testing, and validation sets; and then loads the pre-trained word embeddings. Theseembeddings were extracted from the spacy-2.3.5 Python package’s en_core_web_mdmodel and, to savedisk space, were filtered to only include words that occur in the movie reviews.Your task is to use a logistic regression classifier with aggregated word embedding features to determinethe sentiment labels of documents from their text. First implement the document_to_vector functionwhich converts a document into a vector by first tokenising it (the TreebankWordTokenizer in the nltkpackage would be an excellent choice) and then aggregating the word embeddings of those words that existin the dense word embedding dictionary. You will have to work out how to handle words that are missingfrom the dictionary. For aggregation, the mean is recommended but you could also try other functionssuch as max. Next, implement the fit_model and test_model functions using your document_to_vectorfunction and LogisticRegression from the scikit-learn package. Using fit_model, test_model, andyour training and validation sets you should then try several values for the regularisation parameter Cand select the best based on accuracy. To try regularisation parameters, you should use an automatichyperparameter search method. Next, re-train your classifier using the training set concatenated with thevalidation set and your best C value. Evaluate the performance of your model on the test set.Answer the following questions in your answer PDF:1. What range of values for C did you try? Explain, why this range is reasonable. Also explain whatsearch technique you used and why it is appropriate here.2. What was the best performing C value?3. What was your final accuracy?Also make sure you submit your code.Hint: If you do the practical exercise in Lab 3 this question will be much easier.Tip: If you use TreebankWordTokenizer then for efficiency you should instantiate the class as a globalvariable. The TreebankWordTokenizer compiles many regular expressions when it is initialised; doingthis every time youwant to tokenise a sentence is very inefficient. For more details see the documentation3for TreebankWordTokenizer.Question 2: GenreClassification (Kaggle competition: 4marks,Write-up: 3marks)For this task you will design and implement a classification algorithm that identifies the genre of a pieceof text. This task will be run as a competition on Kaggle. Your marks for this question will be partiallybased on your results in this competition, but your mark will not be affected by other students’ scores,instead you will be graded against several benchmark solutions. The other part of your mark will comefrom your code and write-up.The dataset consists of text sequences from English language books in the genres: horror (class id 0), science fiction (class id 1), humour (class id 2), and crime fiction (class id 3).3https://www.nltk.org/_modules/nltk/tokenize/treebank.html2Each text sequence is 10 contiguous sentences. Your task is to build the best classifier when evaluatedwith macro averaged F1 score.Note: the training data and the test data come from different sets of books. You have been provided withdocids (examples with the same docid come from the same book) for the training data but not the testdata.You have been provided with an example solution in genre_classifier_0pc.py which shows you howto read in the training data (genre_train.json), test data (genre_test.json) and output a CSV file thatthe judging system can read. This solution is provided only as an example (it is the 0% benchmark forthis problem), you will want to build your own solution from scratch.SetupPlease register for Kaggle4 using any email account you like. You do not have to use your full or realname if you would prefer not to.Submit solutions to https://www.kaggle.com/t/5daca656a1e14e159cfd31131ae1c1e0To ensure student submissions are anonymous to other students but not the examiner, a unique Kaggle IDhas been provided on Wattle as text feedback to a dummy assignment called “Kaggle Unique ID”. Thisdummy assignment does not accept submissions, and its only purpose is to distribute unique ids. Setyour “team name” to your unique id (e.g. pid123456789). This will allow us to match your submissionto your assignment for marking purposes. Note that “team name” is a Kaggle term, this is an individualassignment, and so you should not work with others to complete it.The Kaggle contest deadline is the same as the assignment deadline. For late submissions see below.RulesThese rules are designed to ensure a degree of fairness between students with different access to compu-tational resources and to ensure that the task is not trivial. Breaching the contest rules will likely resultin 0 marks for the competition part of this assignment. Do not use additional supervised training data. That is, you are not allowed to collect a new genreclassification dataset to use for training. Pre-training on other tasks, such as language modelling,is permitted. Pre-trained non-contextual word vectors (such as word2vec, GloVe, fasttext) may be used evenif they require an additional download (e.g. you may use word vectors from spacy, genism, orfasttext). You may use the pre-trained transformers distilbert-base-cased or distilbert-base-uncasedfrom the transformers5 library, but other pre-trained transformers are not permitted. You can use the following libraries (in addition to Python standard libraries): numpy, scipy, pandas,torch, transformers, tensorflow, nltk, sklearn, xgboost, gensim, spacy, imblearn, torchnlp.If you would like to use other libraries, please ask on the Piazza forum well in advance of theassignment deadline. This is an individual task, do not collude with other individuals. Copying code from otherpeople’s models or models available on the internet is not permitted.4https://www.kaggle.com/5https://github.com/huggingface/transformers3Judging system You will upload your predictions for the test set as a CSV file for judging. You are allowed to submit up to 5 times per UTC Day. Since you get immediate feedback fromevery submission it is best to start submitting early and plan ahead. The results shown on the public scoreboard prior to the conclusion of the contest only include 50%of the test data. Your solution will be judged on the other 50% of the test data when computingfinal rankings and marks. The judging system allows you to choose which of all your submissions you want to be your finalone.Competition markingMarks will be assigned based on which judge baselines you beat on the hidden 50% of the test data. If,for example, you beat the 80% baseline but do not beat the 90% baseline you will be awarded a markon a linear scale between these two based on your macro averaged F1 score. Exceeding the score ofthe 100% baseline will give you a mark of 100% for the competition component of this question. (The100% baseline and all other baselines can be trained in less than 24 hours on a laptop PC without GPUsupport – your solution may make use of any compute resources available to you). Using Google Colab6is recommended if you want to do GPU training but do not have access to a dedicated GPU.Write-upA fraction of your marks will be based on a write-up that a minimum describes: How your final solution works. How you trained and tested your model (e.g. validation split(s), hyperparameter search etc.). What models you tested, which worked, and which didn’t. Why you think these other models didn’t work.Aim for 1 page or slightly less. Only the first 2 pages will be marked. Bullet points are acceptable if theyare understandable.What to submit Submit the code of your best solution (including training pipeline) to Wattle. Also make sure your“team name” is set to your unique Kaggle ID. Do not submit stored parameters or data. Submit your write-up in your answer PDF file. In one of the three cases below, you should also submit a CSV file with your model’s output in thecorrect judging format. The CSV file should be named with your Uni ID, e.g. your_uid.csv.(a) You have been granted an extension, OR(b) You could not use Kaggle and only if you could not use Kaggle, OR(c) You decide to submit within 24 hours after the assignment deadline (i.e. late submission,with 5% penalty).6https://cloudstor.aarnet.edu.au/plus/s/tqqb8VfcM8IpBTx4Question 3: RNN Name Generator (4 marks)Your task is to develop an autoregressive RNN model which can generate people’s names. The RNN willgenerate each character of a person’s name given all previous characters. Your model should look likethe following when training:Note that the input is shown here as a sequence of characters but in practice the input will be a sequenceof character ids. There is also a softmax non-linearity after the linear layer but this is not shown in thediagram. The output (after the softmax) is a categorical probability distribution over the vocabulary,what is shown as the output here is the ground truth label. Notice that the input to the model is just theexpected output shifted to the right one step with the (beginning of sentence token) prepended.The three dots to the right of the diagram indicate that the RNN is to be rolled out to some maximumlength. When generating sequences, rather than training, the model should look like the following:Specifically, we choose a character from the probability distribution output by the network and feed it asinput to the next step. Choosing a character can be done by sampling from the probability distribution orby choosing the most likely character (otherwise known as argmax decoding).The character vocabulary consists of the following:“” The null token padding string.The beginning of sequence token.. The end of sequence token.a-z All lowercase characters.A-Z All uppercase characters.0-9 All digits.“ ” The space character.Starter code is provided in rnn_name_generator.py, and the list of names to use as training and validationsets are provided in names_small.json.5To complete this question you will need to complete three functions and one class method: the func-tion seqs_to_ids, the forward method of class RNNLM, the function train_model, and the functiongen_string. In each case you should read the description provided in the starter code.seqs_to_ids: Takes as input a list of names. Returns a 2D numpy matrix containing the names rep-resented using token ids. All output rows (each row corresponds to a name) should have thesame length of max_length, achieved by either truncating the name or padding it with zeros. Forexample, an input of: [“Bec.”, “Hannah.”, “Siqi.”] with a max_length set to 6 should return(normally we will use max_length = 20 but for this example we use 6)[[30 7 5 2 0 0][36 3 16 16 3 10][47 11 19 11 2 0]]Where the first row represents “Bec.” and two padding characters, the second row represents“Hannah”, the third row represents “Siqi.” with one padding character.forward: A method of class RNNLM. In this function you need to implement the GRU model shown in thediagram above. The layers have all been provided for you in the class initialiser.train_model: In this method you need to train the model by mini-batch stochastic gradient decent. Theoptimiser and loss function are provided to you. Note that the loss function takes logits (output ofthe linear layer before softmax is applied) as input. At the end of every epoch you should print thevalidation loss using the provided calc_val_loss function.gen_string: In this method you will need to generate a new name, one character at a time. You will alsoneed to implement both sampling and argmax decoding.For this question, please include in your answers PDF the most likely name your code generates usingargmax decoding as well as 10 different names generated using sampling. Also remember to submityour code. Your code should all be in the original rnn_name_generator.py file, other files will not bemarked. For this question you should not import additional libraries, use only those provided in thestarter code (you may uncomment the import tqdm statement if you want to use it as a progress bar).For example, one of the names sampled from the model solution was: Dasbie Miohmazie