R Reference Guide

Loading Data and Packages

The functions library() and require() can both be used to load packages that are already installed. However, require() returns a boolean, which can be handy when sharing code; if the user doesn’t already have the package installed, require() can still allow functions to run and allow for clearer error messages.

#Use library to load penguin data

library(palmerpenguins)

#Use require for the here package: here() helps access files based off the base RProject directory. To use, list subfolders in quotes:

require(here)
## Loading required package: here
## here() starts at /Users/srdee/Documents/GitHub/srdee.github.io/environmental_data
#read.csv() reads in a csv file and stores it as a dataframe. 

ginkgos = read.csv(here("data","ginkgo_data_2022.csv"))

Data Structures

The function c() combines or concatenates its arguments into a vector (a 1-dimensional data structure consisting of 1 or more elements).

All of the elements must be of the same type, e.g. it’s not possiblecan’t combine character and numeric types in the same call to c()

Here’s two examples using numeric and character data types:

## Create a vector of numbers:
num_vec  = c(1, 4, 8, 9, 13)

## Create a vector of characters:
char_vec = c("a", "fish", "data is cool","cats","trees")

## Typing the name of the vector into the console prints the contents
num_vec
## [1]  1  4  8  9 13
## The print() function accomplishes the same task:
print(char_vec)
## [1] "a"            "fish"         "data is cool" "cats"         "trees"
#length() gets the length of any R object for which it's been defined, returning an integer of length 1:

length(char_vec)
## [1] 5

matrix() creates a matrix in R, or can be used to transform data into a matrix. It can take a number of arguments including but not limited to data, nrow (number of rows), and ncol (number of columns).Here’s how to turn the num_vec defined above into a matrix:

#Turn the numerical vector above into a 5x1 matrix:

matrix(num_vec, nrow=5, ncol=1)
##      [,1]
## [1,]    1
## [2,]    4
## [3,]    8
## [4,]    9
## [5,]   13

A data frame is a data structure in R that requires column names, unique row names, and where the stored data can be numeric, facts, or characters. It is the most fundamental data structure in R. Columns should contain the same numbers of data. Below is an example to convert num_vec and char_vec into a data frame:

#The data frame below combines the char_vec and num_vec into a data frame. Note that the names of the lists/vectors become the column names by default, and row names are a numerical ordering of the elements: 

data.frame(char_vec, num_vec)
##       char_vec num_vec
## 1            a       1
## 2         fish       4
## 3 data is cool       8
## 4         cats       9
## 5        trees      13
#Let's switch to the ginkgos data frame to explore more data.frame properties

#nrow() gives the number of rows, as an integer:

nrow(ginkgos)
## [1] 220
#ncol() gives the number of columns, as an integer:

ncol(ginkgos)
## [1] 6
#dim() gives the dimensions of the data.frame (or matrix or vector), returned as rows:
dim(ginkgos)
## [1] 220   6
d <- dim(ginkgos)

#To access just number of rows:
d[1]
## [1] 220

Subsetting

There are a variety of ways to subset data in R:

#Subset the ginkgo data to just access the max_depth column, store as variable to not take up whole window:

md <- ginkgos$max_depth

#Show the head of this:
head(md)
## [1] 59 54 42 48 50 46
#Subset just first row of ginkgo data using []:

ginkgos[1, ]
##   site_id seeds_present max_width max_depth notch_depth petiole_length
## 1     216         FALSE        85        59          12             91
#Show data in row 2, column 3

ginkgos[2,3]
## [1] 72
#Select data in 3rd column of ginkgo data (leaving row space blank allows all rows in the 3rd column):

ginkgos[ ,3]
##   [1]  85  72  64  76  85  75  72  81  87  87  94  92  70  84  61 102  82  64
##  [19]  53  66  69  55  56  64  60  59  57  53  68  56  75  63  82  83  60  67
##  [37]  75  92 110  54  77  59  51 100  92  77  99  62  80  86  63  79  91  72
##  [55]  64  53  71  61  70  93  75  87  86  87  95  88  85  95  95  52  84  74
##  [73]  87  76  57  92  94  79  97  50 100  97  63  96  97  70  83  80  85 100
##  [91] 112 115  83 106  95  80 120  76  50  90  94  69  93  59  60 103  95  90
## [109]  80  90 106  98  73  89  65  83 116  90  86 104  92 103 100  91 102  99
## [127]  98 104  80  68  62  77  86  72  70  63  72  65  69  67  35  67  63  71
## [145]  77  60  70  69  55  72  84 106  61 114  95  86  60  78  86  91  79  95
## [163]  72 102  72 107  95  91  94  90  98  63  58  67  92  75  74  73  73  94
## [181]  76  86  78  74  89  77  80  71  69  81  39  60  50  47  52  61  56  45
## [199]  64  42 103  84 118 120 100 111 104  94  78  70  92  48  79  91 111  44
## [217] 117  93 101 145
#Use subset() to retrieve data for only Adelie penguins

adelie <- subset(penguins, species =="Adelie")
head(adelie)
## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

Numerical Data Exploration

There are a number of methods in R that allow the user to quickly retrieve basic statistics concerning the data.

#Summary shows a range of statistics including counts, means, mins, maxes, and medians for each column of a dataframe:

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
#mean() can be used to calculate the mean of a particular dataframe column. na.rm=TRUE is used to remove empty cells

mean(penguins$bill_depth_mm, na.rm=TRUE)
## [1] 17.15117
#sd() is used to compute the standard deviation, or the amount of dispersion of the data. Lower standard deviation means values are clustered closer to the mean. 

sd(penguins$bill_depth_mm, na.rm=TRUE)
## [1] 1.974793

Graphical Data Exploration

Base R provides a number of ways to make graphics, including scatterplots, histograms, and boxplots.

require(here)
ginkgos = read.csv(here("data","ginkgo_data_2022.csv"))
library(palmerpenguins)

#Create a scatterplot-pch refers to point type and cex scales the axes labels (in this case to 75% of default)

plot(x=ginkgos$max_depth, y=ginkgos$max_width, col="red",pch=9, cex=.75, main="Ginkgo Leaf Depth By Width", xlab="Max Depth (mm)", ylab="Max Width (mm)", xlim=c(19,95), ylim=c(19,130))

#Create a histogram:

#breaks is used to specify the number of bins in which the data is presented

hist(penguins$flipper_length_mm, xlab="Flipper Length (mm)", main="Histogram of Penguin Flipper Lengths", breaks=6)

#Create boxplots using the ginkgo data:

boxplot(ginkgos$petiole_length, main="Ginkgo Petiole Length", ylab="Petiole Length (mm)")

#It is also possible to condition boxplots based on other data:

boxplot(data=ginkgos, max_depth ~ seeds_present, ylab="Max Depth (mm)", xlab="Seeds Present", main="Ginkgo Leaf Depth from Trees \n with and without Seeds Found", cex=.5, ylim=c(20,100))

#You can arrange multiple plots on a single page in R using the par() function to set rows and columns for display:

par(mfrow=c(2,2))
hist(penguins$flipper_length_mm, xlab="Flipper Length (mm)", main="Penguin Flipper Length", breaks=6)
hist(penguins$bill_depth_mm, xlab="Bill Depth (mm)", main="Penguin Bill Depth", breaks=6)
hist(penguins$bill_length_mm, xlab="Bill Length (mm)", main="Penguin Bill Length", breaks=6)
hist(penguins$body_mass_g, xlab="Body Mass (g)", main="Penguin Body Mass", breaks=6)

Distribution Functions

dnorm() and pnorm() operate on normal distributions. Here I’m using them on the penguin data without first checking for normality; with ‘real’ data, it would be important to first confirm a normal distribution.

#calculate mean and standard deviation for sample data (penguin bill depth)

m = mean(na.omit(penguins$bill_depth_mm))
sd = sd(na.omit(penguins$bill_depth_mm))

#dnorm() calculates the probability density of a single event, where the first argument is the event, the second the mean, and the third the standard deviation

dnorm(14.5, mean=m, sd=sd)
## [1] 0.08203888
#this means we have a ~8.2% chance of the penguin having a bill depth of 14.5mm, assuming the data is normally distributed (this data may not be)

#pnorm() calculates the probability of observing the event or less of a certain value

pnorm(14.5, mean=m, sd=sd)
## [1] 0.08971616
#this means we have a ~8.9% chance of observing a bill depth of 14.5mm or less, assuming a normal distribution

Binomial distribution functions include pbinom(), dbinom() and qbinom; they work with discrete distributions:

#Calculate probability of exactly 4 successes during 25 trials with the probability of success on each trial being .5:

dbinom(4,size = 25,prob = .5)
## [1] 0.0003769994
#Calculate the probability of 4 successes or fewer during 25 trials iwth probability of success on each trial being .5:

pbinom(4, size = 25, prob = .5) 
## [1] 0.0004552603
#Calculate probability of greater than 4 successes with all parameters being equal to those above:

1- pbinom(4, size = 25, prob = .5) 
## [1] 0.9995447
#Calculate the 50th percentile of a binomial distribution for p=.4:

qbinom(.4, size=25, prob=.5)
## [1] 12