MATH70071 Applied Statistics
end-of-module assignment
Submission deadline:
12:00 (noon) on Friday, 13/12/2024
Preparing your assignment
1. Use the Rmarkdown template file in the Software folder of the MSc in Statistics 2024-25 Blackboard page to write your report. Your R code should be provided in the appendix; this should be produced automatically by the template provided. Ensure your submitted file has tidy and well documented code chunks.
2. The report should be properly structured, and should be written using complete sentences. Marks are given both for the content of the report (correctness of code, numerical answers, etc.) and the quality of the presentation (clarity of plots, explanations, etc.). Two or three sentences is su伍cient for the verbal/explanatory parts of questions; longer answers are likely to be less clear.
3. At the beginning of your report you must include this statement of originality:
“I, CID [YOUR CID], certify that this assessed coursework is my own work, unless other- wise acknowledged, and includes no plagiarism. I have not discussed my coursework with anyone else except when seeking clarification with the module lecturer via email or on MS Teams. I have not shared any code underlying my coursework with anyone else prior to submission.”
Submitting your assignment
1. Before the above deadline submit a single PDF report via Blackboard (with, as above, your R code included as an appendix).
2. The filename should be MScStatistics AppliedStatistics [YOUR CID].pdf so, e.g., MScStatistics AppliedStatistics 00123456.pdf .
Sociologists in Australia surveyed the public to assess the relationship between the perceived respect of diferent jobs and some objectively measurable attributes of these jobs. The results of the survey are in the table jobs .csv which has these columns:
column name
|
parameter i
|
meaning
row number
|
units/values
|
job:
|
j
|
type of job
|
|
class:
|
c
|
class of job
|
bc, wc, prof
|
salary:
|
s
|
average annual salary of people doing this job
|
$1,000
|
education:
|
e
|
average education of people doing this job
|
years
|
frac men:
|
f
|
fraction of people doing this job who are men
|
|
respect:
|
r
|
average perceived respesct of the job
|
|
The job classes correspond to broadly-accepted classifications: “blue-collar” (bc, e.g., a factory worker or builder); “white-collar” (wc, e.g., an office worker or accountant); and “professional” (prof, e.g., a statistics lecturer or an astrophysicist); s, e, f and r are all numerical quantities.
The overall aim is to use this survey data to obtain a quantitative understanding of if and how the perceived respect of a job and/or its class are linked to objective measureable quantities.
1. Plot r against each of the numerical quantities s, e, and f, indicating c by a diferent colour or symbol.
Based on these plots, summarize briefly i) the implications for what the perceived respect of a job might be linked to and ii) what, if any, features of the data-set might make the subsequent fitting/modelling difficult. (6 marks)
2. Considering the relationship between r and s alone, use the Stan package to fit the data-set using these two regression models:
Model 1: ri = β0 + β1 si + ∈i
and
Model 2: ri = β0 + β1 log(si) + ∈i ,
where P(∈ijσ) = N(∈i; 0, σ2 ), with σ included in the fit as a parameter (i.e., along with β0 and β1 ). State what prior distribution you have assumed for (β0, β1 , σ) and your reasoning behind this choice.
Plot some posterior draws under the two models as curves against the data and comment on the quality of the fits under both models.
Calculate an approximate Bayes factor, B1;2 , as the ratio of the maximum likelihoods under each of the models. Is this consistent with the conclusion from the visual comparison? (10 marks)
3. Use the glm function in R to fit the data using the model
ri = β0 + β1 log(si) + β2 ei + β3 fi.
which now includes all three numerical parameters in the regression.
Report the results of the fit and use the glm summaries to assess which, if any, of the coefficients/terms should be ignored. (10 marks)
4. For the subsequent questions perform the analysis assuming the available background
knowledge K can be encoded in a prior distribution of the form P(β0 , β1 , β2 jK) = N(β1 ; 10, 102) N(β2 ; 5, 52).
Explain qualitatively what information is being encoded by this prior.
Identify whether this is a proper or improper prior and what the implications are for i) parameter estimation and ii) model comparison. (5 marks)
5. Fit the data-set using the mcmc package with the model
ri = β0 + β1 ei + β2 log(si) + ∈i,
both with i) a normal distribution of the form.
P(∈i
|σ) = N(∈i
; 0, σ2
) = (2π) 1/2 σ/1 e − ∈2i/(2σ2)
and ii) a scaled Cauchy distribution of the form.
P(∈i
|σ) = Cauchy(∈i
; 0, σ) = π σ (1 + ∈2i/σ2)/1,
in each case (again) including σ as a parameter to be fit (i.e., along with β0 , β1 and β2 ). As in Question 2, state what prior distribution you have assumed for σ and your reasoning behind this choice.
Plot and compare i) the joint posterior distribution in β1 and β2 and i) the marginal posterior distribution in σ under both the normal and Cauchy models. Simple scatter plots and histograms are acceptable to exhibit the results; for full marks show the 39.3% and 86.3% highest (posterior) density credible regions on the joint plots and the 68.3% and 95.4% highest (posterior) density credible inteverals on the marginal plots.
Comment on the diferences in the results under the two models, explaining the reason(s) for these diferences and which of the two models should be preferred. For full marks demonstrate this result more quantitatively by, e.g., comparing simulated data from the two best-fit models to the actual data used for the fit. (17 marks)
6. You now find out that r is actually the percentage of people surveyed who say they respect a particular occupation, so must be between 0 and 100 (inclusive).
Explain why the models used above are inconsistent with this new information.
Devise and describe mathematically (e.g., by specifying the sampling distribution or like- lihood) a modified regression model which would correctly handle this restriction on r.
For full marks implement this algorithm using any of the glm, mcmc or stan functionality, summarising the results with parameter estimates and uncertainties. (12 marks)
(total: 60 marks)