Chapter 5 Basics of functions
Now we’ll work through examples demonstrating the basics of writing functions.
5.1 First, load the packages we’ll need
Most of what we do in this section will not require specific packages, but you should at least load the tidyverse
in order to follow along.
library(tidyverse)
5.2 Now load the dataset
You’ll need a copy of penguin_data.csv
, which you can find on the OSF page for this example project, here. Once you have it, load it into R.
<- read.csv(file = "penguin_data.csv", stringsAsFactors = FALSE) penguin_data
This dataset is derived from the palmerpenguins
package. The palmerpenguins
site describes the dataset as follows:
“The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.”
We’re using a raw version of the dataset, which looks like this:
head(penguin_data)
study_name | sample_number | species | region | island | stage | individual_id | clutch_completion | date_egg | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | delta_15_n_o_oo | delta_13_c_o_oo | comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 2007-11-11 | 39.1 | 18.7 | 181 | 3750 | MALE | NA | NA | Not enough blood for isotopes. |
PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 2007-11-11 | 39.5 | 17.4 | 186 | 3800 | FEMALE | 8.94956 | -24.69454 | NA |
PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 2007-11-16 | 40.3 | 18.0 | 195 | 3250 | FEMALE | 8.36821 | -25.33302 | NA |
PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 2007-11-16 | NA | NA | NA | NA | NA | NA | NA | Adult not sampled. |
PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 2007-11-16 | 36.7 | 19.3 | 193 | 3450 | FEMALE | 8.76651 | -25.32426 | NA |
PAL0708 | 6 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A2 | Yes | 2007-11-16 | 39.3 | 20.6 | 190 | 3650 | MALE | 8.66496 | -25.29805 | NA |
5.3 Intro to functions
Now we’ll make an example function to illustrate what they do.
Suppose that you wanted to create summary stats for a column in this dataset, including the minimum, maximum, and median body mass (in grams) for all penguins in the dataset. You might start out like this:
<- min(penguin_data[, "body_mass_g"], na.rm = TRUE)
col_min col_min
## [1] 2700
<- max(penguin_data[, "body_mass_g"], na.rm = TRUE)
col_max col_max
## [1] 6300
<- median(penguin_data[, "body_mass_g"], na.rm = TRUE)
col_med col_med
## [1] 4050
We use the functions min()
, max()
, and median()
to generate individual summaries.
We now know that the body_mass_g
column has a min of 2700, max of 6300, and median of 4050. But if we wanted to determine this for multiple columns we’d pretty soon be copying and pasting the same three lines of code over and over again. For example, to summarize four columns using the same method above we would end up with 16 lines of code.
We can combine these into a function that can be used in one line of code on any numeric column using the following ingredients:
function() {
# Contents
return()
}
We just need to:
1) Give the function a name
2) Define arguments for the function
3) Write the body of the function (the code it should execute)
4) Describe what the function should return to the user
# A function that returns the min, max, and median of any numeric column
<- function(data, col_name){
summarize_col
<- min(data[, col_name], na.rm = TRUE)
col_min
<- max(data[, col_name], na.rm = TRUE)
col_max
<- median(data[, col_name], na.rm = TRUE)
col_med
# Return all three values to the function user
return(c(col_min, col_max, col_med))
}
In the above code chunk, we insert code very similar to that in the body_mass_g
example above. We ask the user to provide the dataset (data
) and column of interest (col_name
) as arguments.
This time we generalize the column names to refer to the generic variable called col_name
. The return()
function indicates that the function will provide a vector with multiple values back to us, the users. If you want to return multiple values you will need to put them in a list or vector like this.
When the function is run, it returns a vector with the min, max, and median of the values of whatever column is indicated.
For example:
summarize_col(data = penguin_data, col_name = "body_mass_g")
## [1] 2700 6300 4050
summarize_col(data = penguin_data, col_name = "flipper_length_mm")
## [1] 172 231 197
Now we have made a method that greatly reduces the amount of code we have to write in order to get multiple output values for one column.
5.4 Challenge question!
Try writing a function named summarize_island()
that will return summary statistics only for a specified island the dataset.
Click for answer
Answer:
We can reuse the function we created above, but add a new piece of code that filters the data as needed. We also edit the function’s arguments to include isl_name
. And lastly, the calculations must all refer to the new, filtered dataset.
<- function(data, col_name, isl_name){
summarize_island
# We filter the dataset to only include the island specified by the user
<- data %>%
new_data filter(island == isl_name)
<- min(new_data[, col_name], na.rm = TRUE)
col_min
<- max(new_data[, col_name], na.rm = TRUE)
col_max
<- median(new_data[, col_name], na.rm = TRUE)
col_med
return(c(col_min, col_max, col_med))
}
Let’s test it out:
summarize_island(data = penguin_data,
col_name = "body_mass_g",
isl_name = "Biscoe")
## [1] 2850 6300 4775