Chapter 5 Basics of functions

Now we’ll work through examples demonstrating the basics of writing functions.

5.1 First, load the packages we’ll need

Most of what we do in this section will not require specific packages, but you should at least load the tidyverse in order to follow along.


5.2 Now load the dataset

You’ll need a copy of penguin_data.csv, which you can find on the OSF page for this example project, here. Once you have it, load it into R.

penguin_data <- read.csv(file = "penguin_data.csv", stringsAsFactors = FALSE)

This dataset is derived from the palmerpenguins package. The palmerpenguins site describes the dataset as follows:

“The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.”

We’re using a raw version of the dataset, which looks like this:

study_name sample_number species region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15_n_o_oo delta_13_c_o_oo comments
PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 2007-11-11 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 2007-11-11 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 2007-11-16 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 2007-11-16 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 2007-11-16 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 2007-11-16 39.3 20.6 190 3650 MALE 8.66496 -25.29805 NA

5.3 Intro to functions

Now we’ll make an example function to illustrate what they do.

Suppose that you wanted to create summary stats for a column in this dataset, including the minimum, maximum, and median body mass (in grams) for all penguins in the dataset. You might start out like this:

col_min <- min(penguin_data[, "body_mass_g"], na.rm = TRUE)
## [1] 2700
col_max <- max(penguin_data[, "body_mass_g"], na.rm = TRUE)
## [1] 6300
col_med <- median(penguin_data[, "body_mass_g"], na.rm = TRUE)
## [1] 4050

We use the functions min(), max(), and median() to generate individual summaries.

We now know that the body_mass_g column has a min of 2700, max of 6300, and median of 4050. But if we wanted to determine this for multiple columns we’d pretty soon be copying and pasting the same three lines of code over and over again. For example, to summarize four columns using the same method above we would end up with 16 lines of code.

We can combine these into a function that can be used in one line of code on any numeric column using the following ingredients:

function() {
  # Contents

We just need to:
1) Give the function a name
2) Define arguments for the function
3) Write the body of the function (the code it should execute)
4) Describe what the function should return to the user

# A function that returns the min, max, and median of any numeric column
summarize_col <- function(data, col_name){
  col_min <- min(data[, col_name], na.rm = TRUE)
  col_max <- max(data[, col_name], na.rm = TRUE)
  col_med <- median(data[, col_name], na.rm = TRUE)
  # Return all three values to the function user
  return(c(col_min, col_max, col_med))

In the above code chunk, we insert code very similar to that in the body_mass_g example above. We ask the user to provide the dataset (data) and column of interest (col_name) as arguments.

This time we generalize the column names to refer to the generic variable called col_name. The return() function indicates that the function will provide a vector with multiple values back to us, the users. If you want to return multiple values you will need to put them in a list or vector like this.

When the function is run, it returns a vector with the min, max, and median of the values of whatever column is indicated.

For example:

summarize_col(data = penguin_data, col_name = "body_mass_g")
## [1] 2700 6300 4050
summarize_col(data = penguin_data, col_name = "flipper_length_mm")
## [1] 172 231 197

Now we have made a method that greatly reduces the amount of code we have to write in order to get multiple output values for one column.

5.4 Challenge question!

Try writing a function named summarize_island() that will return summary statistics only for a specified island the dataset.

Click for answer


We can reuse the function we created above, but add a new piece of code that filters the data as needed. We also edit the function’s arguments to include isl_name. And lastly, the calculations must all refer to the new, filtered dataset.

summarize_island <- function(data, col_name, isl_name){
  # We filter the dataset to only include the island specified by the user
  new_data <- data %>%
    filter(island == isl_name)
  col_min <- min(new_data[, col_name], na.rm = TRUE)
  col_max <- max(new_data[, col_name], na.rm = TRUE)
  col_med <- median(new_data[, col_name], na.rm = TRUE)
  return(c(col_min, col_max, col_med))

Let’s test it out:

summarize_island(data = penguin_data,
                 col_name =  "body_mass_g",
                 isl_name = "Biscoe")
## [1] 2850 6300 4775