代写Final 291 section 2代做Python程序

2024-12-11 代写Final 291 section 2代做Python程序

Final 291  section 2 300 points

1)  A school has 1600 students and they are going to vote as to whether they

will completely convert the school completely off fossil fuels. How many

students would you have to poll to be 95% confident of the outcome within +/- 0.5% of the vote? (25 points)

2)  Earthquakes can be broken into two classes based on the directions the

earth moves when they fracture. The classes can be compared across time, for whether earthquakes occur in a given region in a small time period, and if they occur in the next small time period, and it can be added up across

several time periods, below is a table for a region off Indonesia over a 30 year period

Second time period, earthquake happens

Second time period no earthquake happens

Marginal Sums

First time period, Earthquake happens

148

274

422

First time period no earthquake happens

276

2626

2902

Marginal sums

424

2900

3324

Is happening of an earthquake in one time period statistically independent of happening in the next time period? Test at the .01 level (20 points)

These are earthquakes of the same type, Which cells have higher than expected occurrence if independence is true. (Use the deviation table). (10 points)

3)  The earthquake chart is the same chart, only comparing when earthquakes of different types follow one another

Second time period, earthquake happens

Second time period no earthquake happens

Marginal Sums

First time period, Earthquake happens

5

314

319

First time period no earthquake happens

314

2691

3005

Marginal sums

319

3005

3324

Are they statistically independent now (.01 level again) (16 points)

How do the deviations from expectation under independence differ from the chart in problem 2 (hint look at the pattern of pluses and minuses) (8 points)

If you think about what each cell means, what do these differences mean in terms of the way the two types of earthquakes interact (6 points)

4) In NCI60 in the ISLR data set (100)

a.  Identify the cancer types with more than 3 cell lines present.

b.  From those Identify cancers with hyper or hypo active genes at the 0.2 FDR level (not independent)

c.   Identify common genes between every pair of the cancers identified in b.

d.  Are there any genes shared as strangely active between 3 cancers?

5)  The diabetes data set is a prospective study of onset of adult diabetes given

a number of risk factors among the Pima Indian tribe. Using the diabetes.csv data set (100)

a.  Separate the first half of the data from the second half, use the first half for training, second for testing

b.  Using the training data

i.   Construct the full logistic regression model for outcome

ii.   Using backwards selection construct the logistic regression model with every p value for the coefficients < .05 (Show Steps!!!)

c.   Predict the “response” (eg type=”response”) for the full logistic regression model for

i.  the training data set,

ii.    the test data set,

d.  Predict the “response” for the smallest logistic model from the backwards selection exercise

i.  the training data set,

ii.    the test data set,

e.  Using random forest, build a model on the training data

f.   You now have 3 models, Full Logistic, smallest logistic, and random forest. For predictions of each calculate and tabulate

i.   Number of correct positives

ii.   Number of False positives

iii.   Number of correct negatives

iv.   Number of false negatives.

g.  Using the results off, is there one of the 3 methods which appears

best in modeling new results, or does it depend on whether it is more important to identify positives (predict diabetes) or negatives (predict health)

h.  Now redo analysis twice using random selection of 384 out of 768 for training and the complement for testing. Is there anything you can conclude with this additional information about the merits of each approach?

6)  Conceptual question: Suppose you have a null and alternative hypotheses

that are completely defined in terms of the specific probability distributions  they represent. What is the main difference between using a likelihood ratio test, and using bayes rule to decide between the two. (20)