The functions library() and require() can both be used to load packages that are already installed. However, require() returns a boolean, which can be handy when sharing code; if the user doesn’t already have the package installed, require() can still allow functions to run and allow for clearer error messages.
#Use library to load penguin data
library(palmerpenguins)
#Use require for the here package: here() helps access files based off the base RProject directory. To use, list subfolders in quotes:
require(here)
## Loading required package: here
## here() starts at /Users/srdee/Documents/GitHub/srdee.github.io/environmental_data
#read.csv() reads in a csv file and stores it as a dataframe.
ginkgos = read.csv(here("data","ginkgo_data_2022.csv"))
The function c() combines or concatenates its arguments into a vector (a 1-dimensional data structure consisting of 1 or more elements).
All of the elements must be of the same type, e.g. it’s not possiblecan’t combine character and numeric types in the same call to c()
Here’s two examples using numeric and character data types:
## Create a vector of numbers:
num_vec = c(1, 4, 8, 9, 13)
## Create a vector of characters:
char_vec = c("a", "fish", "data is cool","cats","trees")
## Typing the name of the vector into the console prints the contents
num_vec
## [1] 1 4 8 9 13
## The print() function accomplishes the same task:
print(char_vec)
## [1] "a" "fish" "data is cool" "cats" "trees"
#length() gets the length of any R object for which it's been defined, returning an integer of length 1:
length(char_vec)
## [1] 5
matrix() creates a matrix in R, or can be used to transform data into a matrix. It can take a number of arguments including but not limited to data, nrow (number of rows), and ncol (number of columns).Here’s how to turn the num_vec defined above into a matrix:
#Turn the numerical vector above into a 5x1 matrix:
matrix(num_vec, nrow=5, ncol=1)
## [,1]
## [1,] 1
## [2,] 4
## [3,] 8
## [4,] 9
## [5,] 13
A data frame is a data structure in R that requires column names, unique row names, and where the stored data can be numeric, facts, or characters. It is the most fundamental data structure in R. Columns should contain the same numbers of data. Below is an example to convert num_vec and char_vec into a data frame:
#The data frame below combines the char_vec and num_vec into a data frame. Note that the names of the lists/vectors become the column names by default, and row names are a numerical ordering of the elements:
data.frame(char_vec, num_vec)
## char_vec num_vec
## 1 a 1
## 2 fish 4
## 3 data is cool 8
## 4 cats 9
## 5 trees 13
#Let's switch to the ginkgos data frame to explore more data.frame properties
#nrow() gives the number of rows, as an integer:
nrow(ginkgos)
## [1] 220
#ncol() gives the number of columns, as an integer:
ncol(ginkgos)
## [1] 6
#dim() gives the dimensions of the data.frame (or matrix or vector), returned as rows:
dim(ginkgos)
## [1] 220 6
d <- dim(ginkgos)
#To access just number of rows:
d[1]
## [1] 220
There are a variety of ways to subset data in R:
#Subset the ginkgo data to just access the max_depth column, store as variable to not take up whole window:
md <- ginkgos$max_depth
#Show the head of this:
head(md)
## [1] 59 54 42 48 50 46
#Subset just first row of ginkgo data using []:
ginkgos[1, ]
## site_id seeds_present max_width max_depth notch_depth petiole_length
## 1 216 FALSE 85 59 12 91
#Show data in row 2, column 3
ginkgos[2,3]
## [1] 72
#Select data in 3rd column of ginkgo data (leaving row space blank allows all rows in the 3rd column):
ginkgos[ ,3]
## [1] 85 72 64 76 85 75 72 81 87 87 94 92 70 84 61 102 82 64
## [19] 53 66 69 55 56 64 60 59 57 53 68 56 75 63 82 83 60 67
## [37] 75 92 110 54 77 59 51 100 92 77 99 62 80 86 63 79 91 72
## [55] 64 53 71 61 70 93 75 87 86 87 95 88 85 95 95 52 84 74
## [73] 87 76 57 92 94 79 97 50 100 97 63 96 97 70 83 80 85 100
## [91] 112 115 83 106 95 80 120 76 50 90 94 69 93 59 60 103 95 90
## [109] 80 90 106 98 73 89 65 83 116 90 86 104 92 103 100 91 102 99
## [127] 98 104 80 68 62 77 86 72 70 63 72 65 69 67 35 67 63 71
## [145] 77 60 70 69 55 72 84 106 61 114 95 86 60 78 86 91 79 95
## [163] 72 102 72 107 95 91 94 90 98 63 58 67 92 75 74 73 73 94
## [181] 76 86 78 74 89 77 80 71 69 81 39 60 50 47 52 61 56 45
## [199] 64 42 103 84 118 120 100 111 104 94 78 70 92 48 79 91 111 44
## [217] 117 93 101 145
#Use subset() to retrieve data for only Adelie penguins
adelie <- subset(penguins, species =="Adelie")
head(adelie)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
There are a number of methods in R that allow the user to quickly retrieve basic statistics concerning the data.
#Summary shows a range of statistics including counts, means, mins, maxes, and medians for each column of a dataframe:
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
#mean() can be used to calculate the mean of a particular dataframe column. na.rm=TRUE is used to remove empty cells
mean(penguins$bill_depth_mm, na.rm=TRUE)
## [1] 17.15117
#sd() is used to compute the standard deviation, or the amount of dispersion of the data. Lower standard deviation means values are clustered closer to the mean.
sd(penguins$bill_depth_mm, na.rm=TRUE)
## [1] 1.974793
Base R provides a number of ways to make graphics, including scatterplots, histograms, and boxplots.
require(here)
ginkgos = read.csv(here("data","ginkgo_data_2022.csv"))
library(palmerpenguins)
#Create a scatterplot-pch refers to point type and cex scales the axes labels (in this case to 75% of default)
plot(x=ginkgos$max_depth, y=ginkgos$max_width, col="red",pch=9, cex=.75, main="Ginkgo Leaf Depth By Width", xlab="Max Depth (mm)", ylab="Max Width (mm)", xlim=c(19,95), ylim=c(19,130))
#Create a histogram:
#breaks is used to specify the number of bins in which the data is presented
hist(penguins$flipper_length_mm, xlab="Flipper Length (mm)", main="Histogram of Penguin Flipper Lengths", breaks=6)
#Create boxplots using the ginkgo data:
boxplot(ginkgos$petiole_length, main="Ginkgo Petiole Length", ylab="Petiole Length (mm)")
#It is also possible to condition boxplots based on other data:
boxplot(data=ginkgos, max_depth ~ seeds_present, ylab="Max Depth (mm)", xlab="Seeds Present", main="Ginkgo Leaf Depth from Trees \n with and without Seeds Found", cex=.5, ylim=c(20,100))
#You can arrange multiple plots on a single page in R using the par() function to set rows and columns for display:
par(mfrow=c(2,2))
hist(penguins$flipper_length_mm, xlab="Flipper Length (mm)", main="Penguin Flipper Length", breaks=6)
hist(penguins$bill_depth_mm, xlab="Bill Depth (mm)", main="Penguin Bill Depth", breaks=6)
hist(penguins$bill_length_mm, xlab="Bill Length (mm)", main="Penguin Bill Length", breaks=6)
hist(penguins$body_mass_g, xlab="Body Mass (g)", main="Penguin Body Mass", breaks=6)
dnorm() and pnorm() operate on normal distributions. Here I’m using them on the penguin data without first checking for normality; with ‘real’ data, it would be important to first confirm a normal distribution.
#calculate mean and standard deviation for sample data (penguin bill depth)
m = mean(na.omit(penguins$bill_depth_mm))
sd = sd(na.omit(penguins$bill_depth_mm))
#dnorm() calculates the probability density of a single event, where the first argument is the event, the second the mean, and the third the standard deviation
dnorm(14.5, mean=m, sd=sd)
## [1] 0.08203888
#this means we have a ~8.2% chance of the penguin having a bill depth of 14.5mm, assuming the data is normally distributed (this data may not be)
#pnorm() calculates the probability of observing the event or less of a certain value
pnorm(14.5, mean=m, sd=sd)
## [1] 0.08971616
#this means we have a ~8.9% chance of observing a bill depth of 14.5mm or less, assuming a normal distribution
Binomial distribution functions include pbinom(), dbinom() and qbinom; they work with discrete distributions:
#Calculate probability of exactly 4 successes during 25 trials with the probability of success on each trial being .5:
dbinom(4,size = 25,prob = .5)
## [1] 0.0003769994
#Calculate the probability of 4 successes or fewer during 25 trials iwth probability of success on each trial being .5:
pbinom(4, size = 25, prob = .5)
## [1] 0.0004552603
#Calculate probability of greater than 4 successes with all parameters being equal to those above:
1- pbinom(4, size = 25, prob = .5)
## [1] 0.9995447
#Calculate the 50th percentile of a binomial distribution for p=.4:
qbinom(.4, size=25, prob=.5)
## [1] 12