MCD2080 Business Statistics
Trimester 2, 2024
Group Assignment
Problem background: Glassdoor.com
Glassdoor is a free digital platform. that gathers information and reviews from employees or former employees about companies, salaries, and even job openings.
The dataset used for this group assignment contains a random sample of job advertisements from Glassdoor.com. It is used to analyse the current job trends in the data science field based on job positions, company size, software skills, etc.
Refer to the workbook labelled Job Advertisements.xlsx in the Group assignment section on Moodle. This data can be used to understand various software skill requirements and other factors in job advertisements for Data Analysts, Data Engineers and Data Scientists. In this assignment, your task is to investigate and report how the expected salary is associated with various factors such as job types and software skills requirements.
Data definition:
In the file “Job Advertisements.xlsx”, you are provided with both numeric and categorical data. Note that this data has already been cleaned for you, and any missing records are removed. The following table contains the data definition.
Column
|
Column Name
|
Data Definition
|
A
|
Advertisement ID
|
The unique identifier for the job posting
|
B
|
Job Type
|
A simplified job title
|
C
|
Company Name
|
Full name of the company the advertisement is posted for
|
D
|
Company Size
|
Range of number of employees in the company
|
E
|
Ownership Type
|
Company type of ownership. 8 ownership types provided
|
F
|
Industry
|
The industry to which the organisation belongs
|
G
|
Min Salary
|
Minimum expected salary ($ 000 per year) for the job
|
H
|
Expected Salary
|
Average expected salary ($ 000 per year) for the job
|
I
|
Python
|
A binary indicator of whether the job requires Python knowledge/skills (1:Yes, 0:No)
|
J
|
AWS
|
A binary indicator of whether the job requires AWS knowledge/skills (1:Yes, 0:No)
|
K
|
Excel
|
A binary indicator of whether the job requires Excel knowledge/skills (1:Yes, 0:No)
|
Purpose:
We wish to explore the relationships between the expected salary and other independent variables. This is done by utilising the following statistical tools:
1. Pivot Tables and Charts
2. Summary Statistics
3. Confidence Intervals
4. Hypothesis Testing
5. Regression Analysis
Assignment questions:
Answer all questions.
Week 4 Checkpoint: Do question 1
1 a). Discuss and compare the average expected salary for Data Engineers and Data Analysts using the following factors:
• Ownership
• Industry
Construct appropriate charts to support your discussion. Keep your discussion succinct.
Your answer to this question should not be longer than 1-2 pages.
b). We wish to compare the distribution of the expected salary between data analysts and engineers.
Generate Summary statistics and histograms and use them to compare the distributions. In your discussion, include measures of central tendency, variability and shape.
When discussing, include contextual interpretations of the measures used.
Your answer to this question should not be longer than 2 pages. (14 marks)
Week 7 Checkpoint: Do questions 2 & 3.
2. We will now explore the relationship between the expected salary of Data Analysts and Data Engineers.
a). Calculate the 95% Confidence Interval estimate of the true average expected salary for Data Analysts and Engineers. Report your results using the table below.
Confidence Interval Estimate of Average Expected Salary for Job Types
|
Job Type
|
Lower Boundary / Limit
|
Upper Boundary / Limit
|
Data Analysts
|
|
|
Data Engineers
|
|
|
b). Calculate the 95% Confidence Interval estimate of the true average expected salary for Data Analysts and Engineers who have the following software skills:
• Excel
• Python
• AWS
For each variable, report your results using the following format in the examples provided.
Confidence Interval Estimate of Average Expected Salary of Data Analysts requiring Excel Skills
|
Excel Skills
|
Lower Boundary / Limit
|
Upper Boundary / Limit
|
0 (No)
|
|
|
1 (Yes)
|
|
|
Confidence Interval Estimate of Average Expected Salary of Data Engineers requiring Excel Skills
|
Excel Skills
|
Lower Boundary / Limit
|
Upper Boundary / Limit
|
0 (No)
|
|
|
1 (Yes)
|
|
|
(Please use a similar format for Python and AWS)
c). Discuss your results obtained in (a) and (b). Remember to discuss answers for all tables produced.
For part (c) only, the expected length of the answer should be less than a page. (20 marks)
3. We wish to disentangle the relationship between expected salary and Excel skills in each job type.
Use your knowledge in Hypothesis Testing to answer the following questions.
a). Do a majority/minority of data analyst roles require Excel skills?
b). Do a majority/minority of data analyst roles require Python skills? c). Do a majority/minority of data engineer roles require Excel skills? d). Do a majority/minority of data engineer roles require Python skills?
Hint: For each test, state the hypotheses, p-value and conclusion in the context of the question. (6 marks)
Week 11 Final presentation and report submission: Do questions 4 & 5.
4. Estimate a multiple regression model to analyse the relationship between:
Expected salary and all other variables, such as three software skills, the two job types (data analysts and data engineers), and the minimum salary. You are required to produce one multiple regression output.
This section includes an analysis of the statistical significance of various factors in the model. Highlight the key factors that the multiple regression reveals as being the driver of Expected Salary.
Your answer to this question should be approximately 1 to 1.5 pages. (15 marks)
5. Based on the statistical analysis and results in questions 1 to 4, draw conclusions on the following:
a). All factors associated with Expected Salary.
b). The importance of software skills for different job types
c). Recommendations for job seekers to improve their ability to obtain higher-paying employment.
Your answer to this question should be approximately 1 to 1.5 pages. (20 marks)
Assignment marks
The maximum total mark for the assignment is 175. Your total score will be composed of two parts:
• Final assignment report (Questions 1-5): maximum marks of 75.
• Presentation: a maximum mark of 100
(i). Week 4 checkpoint - 20 (staff: 10 & peer to peer evaluation: 10)
(ii). Week 7 checkpoint - 30 (staff:15 & peer to peer evaluation:15)
(iii). Week 11 checkpoint - 40 (staff:20 & peer to peer evaluation:20)
Please note that any group member who will not give feedback to other group members will be awarded zero marks.
You will be required to fill in the peer evaluation on Teammates to be eligible for this component.
Please note that the Unit Leader reserves the right to adjust individual report marks based on the peer evaluation. Should the feedback indicate that an individual did not contribute to the group assignment, the reporting mark will be adjusted to zero, implying that the individual’s group assignment contribution to their final grade will be 0%.
Report requirements:
● All answers should be in font size 12pt and 1.5 spacing.
● Plots and tables must be legible, with appropriate labels to aid readers.
● Statistical results need to be summarised in succinct table formats.
● You will lose marks for poor presentation.
Presentation:
Use PowerPoint or other cloud-based apps eg Google slide, Prezi or Visme, etc.
Week 11 Final Assignment submission guidelines
• The link is set up using an Assignment Tool on Moodle. Please submit the group Report/Answers in Word document or PDF.
• If the question has sub-parts, for example, (a), (b) …, please indicate the labels for each part clearly.
• DO NOT click on "submit all and finish" before you finish all questions.
• ONLY 1 attempt is allowed for the Assignment. Group members should appoint one member to submit on behalf of the group.