Chapter 5 Basics of functions

Now we’ll work through examples demonstrating the basics of writing functions.

5.1 First, load the packages we’ll need

Most of what we do in this section will not require specific packages, but you should at least load the tidyverse in order to follow along.

library(tidyverse)

5.2 Now load the dataset

You’ll need a copy of penguin_data.csv, which you can find on the OSF page for this example project, here. Once you have it, load it into R.

penguin_data <- read.csv(file = "penguin_data.csv", stringsAsFactors = FALSE)

This dataset is derived from the palmerpenguins package. The palmerpenguins site describes the dataset as follows:

“The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.”

We’re using a raw version of the dataset, which looks like this:

head(penguin_data)

study_name	sample_number	species	region	island	stage	individual_id	clutch_completion	date_egg	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex	delta_15_n_o_oo	delta_13_c_o_oo	comments
PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	2007-11-11	39.1	18.7	181	3750	MALE	NA	NA	Not enough blood for isotopes.
PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186	3800	FEMALE	8.94956	-24.69454	NA
PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195	3250	FEMALE	8.36821	-25.33302	NA
PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	2007-11-16	NA	NA	NA	NA	NA	NA	NA	Adult not sampled.
PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193	3450	FEMALE	8.76651	-25.32426	NA
PAL0708	6	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A2	Yes	2007-11-16	39.3	20.6	190	3650	MALE	8.66496	-25.29805	NA

5.3 Intro to functions

Now we’ll make an example function to illustrate what they do.

Suppose that you wanted to create summary stats for a column in this dataset, including the minimum, maximum, and median body mass (in grams) for all penguins in the dataset. You might start out like this:

col_min <- min(penguin_data[, "body_mass_g"], na.rm = TRUE)
col_min

## [1] 2700

col_max <- max(penguin_data[, "body_mass_g"], na.rm = TRUE)
col_max

## [1] 6300

col_med <- median(penguin_data[, "body_mass_g"], na.rm = TRUE)
col_med

## [1] 4050

We use the functions min(), max(), and median() to generate individual summaries.

We now know that the body_mass_g column has a min of 2700, max of 6300, and median of 4050. But if we wanted to determine this for multiple columns we’d pretty soon be copying and pasting the same three lines of code over and over again. For example, to summarize four columns using the same method above we would end up with 16 lines of code.

We can combine these into a function that can be used in one line of code on any numeric column using the following ingredients:

function() {
  
  # Contents
  
  return()
  
}

We just need to:
1) Give the function a name
2) Define arguments for the function
3) Write the body of the function (the code it should execute)
4) Describe what the function should return to the user

# A function that returns the min, max, and median of any numeric column
summarize_col <- function(data, col_name){
  
  col_min <- min(data[, col_name], na.rm = TRUE)
  
  col_max <- max(data[, col_name], na.rm = TRUE)
  
  col_med <- median(data[, col_name], na.rm = TRUE)
  
  # Return all three values to the function user
  return(c(col_min, col_max, col_med))
  
}

In the above code chunk, we insert code very similar to that in the body_mass_g example above. We ask the user to provide the dataset (data) and column of interest (col_name) as arguments.

This time we generalize the column names to refer to the generic variable called col_name. The return() function indicates that the function will provide a vector with multiple values back to us, the users. If you want to return multiple values you will need to put them in a list or vector like this.

When the function is run, it returns a vector with the min, max, and median of the values of whatever column is indicated.

For example:

summarize_col(data = penguin_data, col_name = "body_mass_g")

## [1] 2700 6300 4050

summarize_col(data = penguin_data, col_name = "flipper_length_mm")

## [1] 172 231 197

Now we have made a method that greatly reduces the amount of code we have to write in order to get multiple output values for one column.

5.4 Challenge question!

Try writing a function named summarize_island() that will return summary statistics only for a specified island the dataset.

Click for answer

Answer:

We can reuse the function we created above, but add a new piece of code that filters the data as needed. We also edit the function’s arguments to include isl_name. And lastly, the calculations must all refer to the new, filtered dataset.

summarize_island <- function(data, col_name, isl_name){
  
  # We filter the dataset to only include the island specified by the user
  new_data <- data %>%
    filter(island == isl_name)
  
  col_min <- min(new_data[, col_name], na.rm = TRUE)
  
  col_max <- max(new_data[, col_name], na.rm = TRUE)
  
  col_med <- median(new_data[, col_name], na.rm = TRUE)
  
  return(c(col_min, col_max, col_med))
  
}

Let’s test it out:

summarize_island(data = penguin_data,
                 col_name =  "body_mass_g",
                 isl_name = "Biscoe")

## [1] 2850 6300 4775