In this figure, do you see masses of points forming what look like vertical lines? What does it tell you about the variety of diamond offerings? What does it tell you about diamond pricing?
Name of the aesthetics
|
Meaning
|
X
|
X axis position
|
y
|
Y axis position
|
color
|
Color of dots, outlines of other shapes
|
fill
|
Fill color
|
size
|
Diameter of points, thickness of lines
|
alpha
|
transparency. 0 – transparent; 1 – opaque
|
linetype
|
Line dash pattern
|
shape
|
shape of the points
|
label
|
Shape of the points in word or letters
|
In the figure in Part b, we see that diamond price varies widely for a specific carat. We want to explore more about the factors that explain price differences. Let’s add the variable “clarity” to the figure to see what new information we can see. We will map clarity onto color, meaning we will use different colors to represent different levels of clarity. Fill in the command in the R file. Paste your code and figure below. Describe what new information you get from this figure in terms of diamond pricing.
Question 3 Price by cut, color, and clarity.
In the previous question, we see that the price generally increases with the carat. Let’s see how the price distribution changes by diamond cut, color, and clarity. To make a fair comparison across diamonds, let’s only look at diamonds exactly 1 carat (carat==1).
Note: if we don’t fix the carat, we may be comparing a D-color 0.5-carat diamond with an H-color 1.5-carat diamond. If we find that the 0.5-carat D-color diamond is cheaper, it doesn’t mean that the D-color diamond is generally cheaper than the H-color diamond; it could just be that the difference in carat is playing a big role in determining the diamond prices.
(0.1 points) Part a In this part, let’s create a new data frame. only for diamonds that are exactly 1 carat and name it “carat1”. Use the condition carat==1 to filter the data. Recall that you can use either the subset() command or %>% with filter(). Paste your R command below.
(1.2 points) Part b. Imagine that you need to describe to the client how the prices of 1-carat diamonds vary with the diamond cut, and you want to visualize it yourself. Let’s try using the scatter plot and the boxplot. Complete the code in the R file. Paste your R code and the two different figures below.
Why do the points in the scatter plot not look like what you saw in the previous question but look like vertical lines?
What do the first and the second boxplot (from the left) represent?
Do you prefer the scatter plot or the boxplot in this case, and why?
Now, let’s use boxplots to describe how the 1-carat diamond price varies with color and clarity. Create two additional boxplots using the carat1 data. In the first boxplot, map color onto the x-axis and price onto the y-axis. In the second boxplot, map clarity onto the x-axis and price onto the y-axis.
Paste your R code and the two box plots below.
(0.7 points) Part c Examine the plots done in part b and answer the following questions. Imagine you are answering the following questions asked by a client who is interested in buying a 1-carat diamond.
How does the median price vary by the diamond cut? By diamond color? By diamond clarity?
How much do median prices differ between the diamonds of the VVS1 clarity grade and IF clarity grade? (You can eyeball the rough number based on the figure. The answer just needs to be correct in the ballpark.)
Are there diamonds with a “Good” cut as expensive as diamonds with an “Ideal” cut?
Within different diamonds of a particular clarity grade, does the variation in prices differ by color, and what is the pattern? (Check the interquartile range (IQR).) Based on the graph, what is the IQR for diamonds with grade IF?
(0.3 points) Part d. The boxplot does not tell us the mean price based on clarity. Fill in the R command in the R template and paste the command below that reports the average price for the 1-carat diamond for different clarity.
(0.3 points) Part e. Plot the mean price by clarity you calculated in part d using the bar chart. Paste your R command and graph below.
Question 4 geom_smooth()
The scatter plots illustrate the raw pattern of two variables. According to this pattern, R is able to estimate /predict the relationship between these two variables by fitting a curve. In this question, we will use the fitted curve to explore the variable relationships. The geometrics (the type of graph) we will use is called geom_smooth().
(0.2 points) Part a Let’s first use the subsample to see how this fitted curve looks like. Let’s use a subsample of diamonds with the cut level of “Ideal”. Fill in the command to create this subsample, and we will name the subsample “ideal”. Paste your R code below. What percentage of all diamonds are in the “ideal” subsample?
(0.2 points) Part b Now fill in the command to create fig_q4_b. We will only use the sample “ideal” for this exercise. Map carat onto the x-axis and price onto the y-axis and plot both a scatter plot and the fitted smoothed curve. Paste your code and graph below.
Note: The grey band underneath the fitted line (more obvious on the upper right corner) represents the standard error of the estimated fitted line. Recall that the standard error is small when the sample size is large and vice versa. The standard error band is very narrow on the lower left figure because the number of diamonds is large in that region.
(0.7 points) Part c: Now let’s return to the full sample and explore the relationship between price and carat for different cut. When we map a variable onto color, we will group the observations by this variable distinguished visually by color. In this example, let’s use color to distinguish diamonds of different cut.
Fill in the command in the R script. file, paste your code and the graph below. (Notice that the shape of the curve for the “Ideal” cut of diamonds should be the same as the one you got in Part b.)
What does each curve represent in the pasted figure?
Compare “Fair” cut and “Ideal” cut diamonds in the figure you just plotted, how do their patterns differ?
· First discuss how prices differ between two cuts of the diamonds with the same carat.
· Then discuss how price changes with carat for “Fair” and “Ideal” cut diamonds respectively.
(0.4 points) Part d The inverted-U fitted curve for the “Ideal” cut diamonds suggest that for larger carats of diamonds, price actually declines with price. Let’s explore the driving force behind it.
In the R code file, follow the instructions to (1) use the ideal cut of diamonds and (2) graph the scatter plot between price and carat and map clarity onto the color. Paste your code and graph below. With what you see in the figure, explain what is driving the inverted-U for the ideal cut diamond.