帮写CISC 6930、辅导CSV/MSE、辅导CS/OS程序、辅导留学生OSAssignment

2018-09-27 帮写CISC 6930、辅导CSV/MSE、辅导CS/OS程序、辅导留学生OSAssignment
CISC 6930: Data MiningFordham University, Fall 2018 Prof. Yijun ZhaoAssignment 1Due: Sept. 28Submission Instructions? Your program must run on erdos.dsm.fordham.edu? Create a README file, with simple, clear instructions on how to compileand run your code. If the TA cannot run your program by following the instructions,you will receive 50% of programing score.? Zip all your files (code, README, written answers, etc.) in a zip file named{f irstname} {lastname} CS6930 HW1.zip and upload it to BlackboardIn this assignment, you are given the following 3 datasets. Each dataset has a training anda test file. Specifically, these files are:dataset 1: train-100-10.csv test-100-10.csvdataset 2: train-100-100.csv test-100-100.csvdataset 3: train-1000-100.csv test-1000-100.csvStart the experiment by creating 3 additional training files from the train-1000-100.csvby taking the first 50, 100, and 150 instances respectively. Call them: train-50(1000)-100.csv, train-100(1000)-100.csv, train-150(1000)-100.csv. The corresponding test file forthese dataset would be test-1000-100.csv and no modification is needed.1. Implement L2 regularized linear regression algorithm with λ ranging from 0 to 150(integers only). For each of the 6 dataset, plot both the training set MSE and the testset MSE as a function of λ (x-axis) in one graph.(a) For each dataset, which λ value gives the least test set MSE?(b) For each of datasets 100-100, 50(1000)-100, 100(1000)-100, provide an additionalgraph with λ ranging from 1 to 150.(c) Explain why λ = 0 (i.e., no regularization) gives abnormally large MSEs for thosethree datasets in (b).2. From the plots in question 1, we can tell which value of λ is best for each dataset oncewe know the test data and its labels. This is not realistic in real world applications. Inthis part, we use cross validation (CV) to set the value for λ. Implement the 10-foldCV technique discussed in class (pseudo code given in Appendix A) to select the bestλ value from the training set.1(a) Using CV technique, what is the best choice of λ value and the corresponding testset MSE for each of the six datasets?(b) How do the values for λ and MSE obtained from CV compare to the choice of λand MSE in question 1(a)?(c) What are the drawbacks of CV?(d) What are the factors affecting the performance of CV?3. Fix λ = 1, 25, 150. For each of these values, plot a learning curve for the algorithmusing the dataset 1000-100.csv.Note: a learning curve plots the performance (i.e., test set MSE) as a function of thesize of the training set. To produce the curve, you need to draw random subsets (ofincreasing sizes) and record performance (MSE) on the corresponding test set whentraining on these subsets. In order to get smooth curves, you should repeat the processat least 10 times and average the results.Appendix A10-Fold Cross Validation for Parameter SelectionCross Validation is the standard method for evaluation in empirical machine learning. It canalso be used for parameter selection if we make sure to use the training set only.To select parameter λ of algorithm A(λ) over an enumerated range λ ∈ [λ1, . . . , λk] usingdataset D, we do the following:1. Split the data D into 10 disjoint folds.2. For each value of λ ∈ [λ1, . . . , λk]:(a) For i = 1 to 10? Train A(λ) on all folds but ith fold? Test on ith fold and record the error on fold i(b) Compute the average performance of λ on the 10 folds.3. Pick the value of λ with the best average performanceNow, in the above, D only includes the training data and the parameter is chosen withoutknowledge of the test data. We then re-train on the entire train set D using the chosenparameter value and evaluate the result on the test set.