7SSGN110 Environmental Data Analysis | Practical 2 | Introduction to R & data exploration
1. Introduction
1.1. About this practical
This practical is focused on introducing you to the basics of R, RStudio and R Markdown. The aim of the practical is to advance your learning of new technical tools (R and RStudio) for data exploration, description and data visualisation. During the session we will be investigating the differences in characteristics in annual rainfall from four locations in the UK. The aim is to determine what impact longitude and altitude have on the amount and seasonal distribution of rainfall.
The practical uses annual rainfall totals from four locations across northern England for the 50-year period from 1941 to 1990 (MetOffice, 1993). The attributes of the four sites that we will be investigating are given in Table 1 and plan/cross-sectional profiles of the four locations are in Figures 1 and 2.
Table 1. Summary of the spatial characteristics of the four sites used in this practical
SITE
|
Denton
|
Redmires
|
Sheffield
|
Kirk Bramwith
|
Altitude (m)
|
93
|
305
|
131
|
7
|
Latitude
|
53.45
|
53.37
|
53.38
|
53.60
|
Longitude
|
2.21
|
1.57
|
1.47
|
1.07
|
Figure 1a. The spatial location of the four climate stations used in today’s practical. Note that this map was created using Digimap (2013), which allows access, customization, and annotation ordnance survey maps (e.g. for reports). Figure 1b. Altitude of the 4 study sites. One method for visualizing altitude vs. distance of the four climate stations, projected onto a ‘linear’ line, and relative distance along the line indicated as Eastings (km).
1.2. Practical structure
The practical session comprises 4 parts:
1. Data familiarisation & producing descriptive statistics using MS Excel
2. Start R and First R Analysis – if you haven’t completed this already
3. Rainfall data exploration using R
4. Additional, optional exercise – air quality data exploration
Associated with practical are 15 questions for you to answer. These are to test your understanding of key concepts in the practical. Answers to the questions & example script. will be posted on KEATS a few days after the practical session.
1.3. Required files & saving your data
You can access the files for this practical, RainfallData.xlsx and RainfallData.csv via KEATS. You may remember that last week we put particular emphasis on establishing the correct place to save your data. Save your data to an appropriate working directory (folder) titled “EDA_Practical2”.
2. Data familiarization & descriptive statistics using Excel
Before attempting any form. of analysis, it is important to know what our objective is, what data we have available to us and where this data was collected. Assuming most of you are more comfortable with Excel than R, initially we’ll do some brief data exploration in Excel. Hopefully, this way you’ll have some familiarisation with the data before introducing R.
Calculate some summary statistics in Excel:
1. Download the worksheet ‘RainfallData.xlsx’
2. In the worksheet containing your data click on View -> Freeze Panes -> Freeze Top Row. Now you will always be able to see the column labels if you scroll down the data in this view.
3. Scroll down to the bottom of the data.
4. Calculate the mean, median, maximum, minimum, standard deviation and range for the four locations. To do this enter the following formulas in different cells, selecting the appropriate data for ‘Data range’:
• =AVERAGE(‘Data range’)
• =MEDIAN(‘Data range’)
• =MAX(‘Data range’)
• =MIN(‘Data range’)
• =STDEV(‘Data range’)
Note: you’ll need to use some combination of these formulas to calculate the range.
Using the values you have calculated answer the following questions:
• Q1: Which site has the greatest inter-annual variation in rainfall?
• Q2: Rank the locations in order of ‘wetness’
When did you last save your Excel Workbook? If you haven’t already you should get in the habit of saving your work (in Excel, Word, etc.) frequently. Do so now in Excel (.xslx) format.
We’ll also save the data in Comma Separated Values (.csv) format. This can be done using Save As and selecting the appropriate file format. See online help or ask the GTAs if you need further assistance. This is good practice as csv format is generally what we will use with R (as we will now see).
3. Start R and First R Analysis
The practical instructions assume you have read and followed the instructions in the StartR and First R Analysis activities online. If you have not yet worked through these activities STOP here and work through these activities before you start the rainfall data analysis.
Once you have completed Start R and First R Analysis answer the following questions:
• Q3: What command do you use to use the contents of an object?
• Q4: What is wrong with the following line of code?
TreeDiameters(mean)
• Q5: What is the difference between the source and the console pane?
4. Rainfall data analysis using R
4.1. Getting started in R
This document contains code. In places this code is annotated. You can do this useful by using the ‘#’ before any annotations you make. This is very useful for remembering what you have done and why!
answer<-1+2 # Sums 1 and 2
print(answer) # Prints the answer
4.1.1. Setting the working directory
Open RStudio. First we need to set the working directory to wherever we have saved the data files (so that R knows where to look for them). This can be done one of two ways: 1. through the R Studio user interface 2. through code.
Setting the working directory through the user interface:The easiest way to set the working directory using RStudio’s user interface:
4.1.2. Figure 2. Setting the working directory through the user interface
Setting working directory through code If you are familiar with the path to your data you can use the set directory command, setwd, altering the path according to your preferred working directory.
setwd("X:/My Documents/EDA_Practical2") # Sets the working directory
getwd() # Prints the working directory
4.1.3. Loading packages
Packages are optional parcels of software that are downloaded and installed directly into R. There are thousands of packagex which allow us to undertake a range of different analysis. If is it the first time you are using a package you will need to install it in RStudio. Once it is installed, you will then only need to load it when you start R Studio. The StartR page gives you guidance on the ways you can do this.
The packages will we need today are: * ‘tidyr’ * ‘ggplot2’
A good habit to get into is loading the packages before you start running any code. You can do this by using the library function:
library(tidyr)
library(ggplot2)
4.1.4. Reading a .csv file
Once the working directory is set correctly, you can read the data in to R. This can be done with following command:
rainfall.data <- read.csv("RainfallData.csv", header = T)
Specifically, what read.csv() does is create a data frame. called rainfall.data from the csv file RainfallData.csv. You can name the data frame. anything you want - in R tutorials “my_data” is frequently used. However, when you are dealing with lots of dataframes in one script, it is good to name them something intuitive - my_data1, my_data2, my_data3 can get a bit confusing!
The header argument tells R whether the first row of data contains the names of the columns (in this case T indicates this is true); if you use csv files with R it is generally a good idea for the first line of the file to contains column headers.
To print all the data to screen, you could enter the following:
print(rainfall.data)
Alternatively, you can also just enter the name of the data frame.
rainfall.data
Usually we don’t want to look at the whole data frame, rather just check to see if the data have been imported correctly. We can use the ‘head’ function to view the first few lines of the data frame. In this case, ‘head’ is the function and ‘rainfall.data’ is the object.
head (rainfall.data)
## Year Denton Redmires Sheffield Kirk..Bramwith
## 1 1941 750 1139 881 532
## 2 1942 909 938 679 475
## 3 1943 960 977 660 411
## 4 1944 1103 1236 847 661
## 5 1945 944 979 682 436
## 6 1946 1091 1400 985 701
4.1.5. Dataframes, vectors & index numbers
A data frame. essentially works like a table. In the data frame, each column contains the value of one variable and also each row contains the value of each column. From this you can note that our data frame. has five columns of data:: Year, Denton, Redmires, Sheffield, and Kirk Bramwith. Note how this last column has had its header changed by R to remove spaces
These columns of data are ‘vectors’ - lists of items of the same type. In this case our vectors are numeric (rainfall values and year). In other cases they could be strings (characters or classes or data) or logical arguments (TRUE or FALSE).
So you can think of a data frame. as a list of vectors (i.e. a list of columns), each of which has a name or numerical index.
We can access a vector in a dataframe. by using the ‘$’ symbol after the dataframe. name followed by the name of the vector. Lets say we want to access the rainfall data from Sheffield:
print(rainfall.data$Sheffield) #Dataframe. name $ Vector name
## [1] 881 679 660 847 682 985 771 735 712 773 938 693 597 982 699
## [16] 900 733 963 609 1035 717 688 736 642 1037 968 802 876 947 787
## [31] 770 810 763 775 560 675 884 824 961 879 940 833 906 867 675
## [46] 998 790 916 737 764
We can also access as vector using index numbers. Every row and column in a dataframe. has a number assigned. These are sequential. So the first column in the dataframe. will have a value of ‘1’ and so on. The same is true for rows. However, if your dataframe. has column headings (like our rainfall.data) then row number ‘1’ will be the first row that contains observation. In our case this will be the row containing the 1941 rainfall data.
Index numbers are written in the format [row, column], so to find the values in row ‘1’ the format would be [1,]. To find the values in column 3, the format would be [,3].
So to access the rainfall data for Sheffield using the index numbers, the code would be:
print (rainfall.data [,4]) #Dataframe. name [Index number]
## [1] 881 679 660 847 682 985 771 735 712 773 938 693 597 982 699
## [16] 900 733 963 609 1035 717 688 736 642 1037 968 802 876 947 787
## [31] 770 810 763 775 560 675 884 824 961 879 940 833 906 867 675
## [46] 998 790 916 737 764
We can combine these index numbers to access specific observations in a dataframe. Lets say we wanted to access the rainfall value for Denton (column 2) in 1946 6 (row number 6)
print (rainfall.data [6,2]) #Dataframe. name [Index number]
## [1] 1091
• Q6. Adapting the code above, find the rainfall value for Redmires in 1950
Hopefully you have an intuitive understanding of this now, but it will certainly come clearer with practice. Now, back to the descriptives…
4.2. Descriptive statistics
4.2.1. The ‘summary’ function
The ‘summary’ function is simple function for exploring data frame. It computes summary statistics of data and model objects.
summary (rainfall.data)
## Year Denton Redmires Sheffield
## Min. :1941 Min. : 650.0 Min. : 740.0 Min. : 560.0
## 1st Qu.:1953 1st Qu.: 827.2 1st Qu.: 977.5 1st Qu.: 713.2
## Median :1966 Median : 917.5 Median :1055.0 Median : 788.5
## Mean :1966 Mean : 916.3 Mean :1077.9 Mean : 808.0
## 3rd Qu.:1978 3rd Qu.: 976.5 3rd Qu.:1190.2 3rd Qu.: 904.5
## Max. :1990 Max. :1370.0 Max. :1400.0 Max. :1037.0
## Kirk..Bramwith
## Min. :401.0
## 1st Qu.:491.5
## Median :572.0
## Mean :577.0
## 3rd Qu.:650.2
## Max. :812.0
With these simple statistics we can compare the upper and lower limits of the measurements (min and max), central tendency (mean and median) and some indicators of the dispersion of the data around the central tendency (1st & 3rd quantiles).
Check that the values you have just calculated in R match those in Excel. If they don’t you’ve gone wrong somewhere…
4.2.2. Summaries with ‘apply()’
Two useful measures of dispersion of data not provided by summary() are the standard deviation and the interquartile range (IQR). The functions to calculate standard deviation and IQR are sd() and IQR() respectively. However, these functions can only be used on vectors. In this case, we need to specify the vector we want to run the analysis on.
For example, we want to calculate the standard deviation of rainfall data from Denton. We can use both the vector and numerical index approach above:
sd(rainfall.data$Denton)
## [1] 132.7956
sd(rainfall.data [,2])
## [1] 132.7956
• Q7. Calculate the IQR for the Denton rainfall data
Calculating statistics for each individual vector in a data frame. becomes more time-consuming using the apporach above when you have a larger dataset. To overcome this we can use the apply() function. apply() runs through the vectors of a data frame, applying a function to each as it goes. For example, here’s how to use apply() to calculate the standard deviation for each of the columns in our data frame.
apply (rainfall.data, 2, sd) # function (data, columns, function)
## Year Denton Redmires Sheffield Kirk..Bramwith
## 14.57738 132.79560 155.58496 122.54070 107.75085
Working backwards through the arguments, the command above tells R we want to apply the sd() function (sd argument - the function name without parentheses) to the columns (2 argument) of the my_data data frame. (my_data). Note that because we used the name of the data frame. as an argument we have been given the standard deviaiton of all columns, including the year (but think about why the standard deviation of the year column is not really very useful). To calculate value for only some columns we can use indexing:
apply (rainfall.data[,2:5], 2, sd) # function (data, columns, function)
## Denton Redmires Sheffield Kirk..Bramwith
## 132.7956 155.5850 122.5407 107.7508
To calculate a different function on the columns, we simply change the final argument of the apply() function.
• Q8. Calculate the IQR for each variable
The apply() function can also be used across rows of a data frame. by changing the second argument from 2 to 1:
apply (rainfall.data[,2:5], 1, sd) # function (data, columns, statistical function)
## [1] 253.7748 217.0182 269.9593 257.4120 253.5592 288.7183 244.1407 249.1270
## [9] 226.8098 227.6597 234.6591 271.7799 121.3342 357.4624 208.9186 260.6524
## [17] 208.3979 235.1430 173.8246 258.7058 214.6678 207.1077 181.0359 213.1531
## [25] 233.0186 221.3271 178.6531 172.0136 167.1175 203.4296 137.5645 210.0625
## [33] 208.6097 202.9540 147.1198 157.1143 218.2583 183.0799 215.2059 243.4124
## [41] 265.9536 225.9002 220.6913 199.8639 149.3751 279.8083 178.9010 243.1726
## [49] 181.2530 247.2259
In this case we are calculating the standard deviation of the rainfall data from ALL sites in each year. Note: in the code above it is very important we tell R to only calculate values for columns 2 to 5. Otherwise, we would be including the Year value in the calculation, along with the rainfall values