Chapter 8 Building the workflow around functions

8.1 Writing functions

Next let’s get into the meat of our analysis by writing the functions.R script. This is going to be our longest script in the workflow.

As a reminder, this is what our target list currently looks like:

list(
  
  tar_target(name = penguin_data,
             command = ),
  
  tar_file(penguin_file,
  ),
  
  tar_target(cleaned_data,
  ),
  
  tar_target(exploratory_plot,
  ),
  
  tar_target(penguin_model,
  ),
  
  tar_render(markdown_summary,
  )
  
)

We can start out by building out the function that creates the target called penguin_data. Below is the code that was used to download the dataset we used in the function introduction section. This can be repurposed into our target.

# Instructions adapted from code written by Julien Brun and posted on the
# palmerpenguins package site:
# https://allisonhorst.github.io/palmerpenguins/articles/download.html

library(tidyverse)
library(janitor)

# Download and compile datasets into a single data frame
penguin_data <- map_dfr(.x = 
                          c(
                            # Adelie penguin data
                            "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
                            # Gentoo penguin data
                            "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
                            # Chinstrap penguin data
                            "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462"
                          ),
                        .f = read_csv)

# Export data frame as .csv after cleaning up column names for easier use
write_csv(x = clean_names(penguin_data),
          file = "example_project/data/penguin_data.csv")

Remade as a function for use in the pipeline:

download_penguins <- function(out_path){
  
  # Instructions adapted from code written by Julien Brun and posted on the
  # palmerpenguins package site:
  # https://allisonhorst.github.io/palmerpenguins/articles/download.html
  
  # Download and compile datasets into a single data frame
  penguin_data <- map_dfr(.x = 
                            c(
                              # Adelie penguin data
                              "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
                              # Gentoo penguin data
                              "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
                              # Chinstrap penguin data
                              "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462"
                            ),
                          .f = read_csv)
  
  # Export data frame as .csv after cleaning up column names for easier use
  write_csv(x = clean_names(penguin_data),
            file = out_path)
  
  return(out_path)
  
}

Let’s inspect the changes that have been made to the script in order to turn it into a function in our workflow. The first step was to decide on a function name, using a verb to describe the action the function is doing, and wrap the code in a call to function(){}. Next we included an argument to the function, out_path, which asks the user to provide a file path for where they want the penguin .csv file to be saved. This path is used in the write_csv() function, and then later returned as output from download_pengions() so that the file path can be tracked by the pipeline. We’ve also removed the library() calls from this function; packages can instead be loaded using tar_target() in the main _targets.R script.

You can copy the above function definition as-is into your functions.R script in the R/ folder. Let’s move on to a function that cleans up the data.

This dataset doesn’t need a ton of cleaning before we begin using it. However, we might want to make some changes that will simplify it.

clean_dataset <- function(file_path){
  
  cleaned_data <- read_csv(file = file_path) %>%
    mutate(
      # Remove unneeded text from column
      species = str_extract(string = species,
                            pattern = "Chinstrap|Adelie|Gentoo"),
      # Extract year from date
      year = year(date_egg)) %>%
    # Select columns of interest
    select(species, island, year, culmen_length_mm, culmen_depth_mm, 
           flipper_length_mm, body_mass_g, sex) %>%
    # Remove rows with incomplete records
    filter(!if_any(everything(),
                   is.na))
  
  return(cleaned_data)
  
}

This new version of the dataset should be a little more user friendly for analysis now. We can paste this into our functions.R script. Also note that we are returning the dataframe cleaned_data from the function, but not writing it to a new .csv file. This is totally optional, but in this case we’ll just document the original, “raw” data in .csv format.

Now with clean data in hand we want to investigate a hypothetical relationship between body mass, flipper length, and penguin species. We’ll put together a figure that displays flipper_length_mm on the x-axis and body_mass_g on the y-axis with points and overlaid lines colored by species. Because we’re going to put together an R Markdown report at the end of the pipeline we won’t export this as a separate image file. However you could do so if desired, following the steps we used for exporting the .csv file previously.

plot_body_mass <- function(cleaned_data){
  
  # Plot the relationship between flipper length, body mass, and species
  bm_plot <- ggplot(data = cleaned_data,
                    aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(fill = species),
               shape = 21, color = "black", size = 2.5, alpha = 0.6) +
    geom_smooth(aes(color = species), method = "lm", se = FALSE) +
    scale_fill_viridis_d("Species", option = "viridis") +
    scale_color_viridis_d("Species", option = "viridis") +
    xlab("Flipper length (mm)") +
    ylab("Body mass (g)") +
    theme_bw()
  
  return(bm_plot)
  
}

Notice that we are now referring to cleaned_data as the input to this function. You could name this argument whatever you want, but being clear about which upstream targets or data sources are being fed into downstream steps can help keep the workflow organized.

Now having coded up a plot that visualizes the relationship we’re interested in, let’s describe it with a model as well. This function will be pretty minimal:

run_model <- function(cleaned_data, model_string){
  
  penguin_model <- lm(data = cleaned_data,
                      formula = model_string)
  
  return(penguin_model)
  
}

Here we are asking the user for a model_string as input, by which we mean a character string written in modeling formula format. For example, when we use this function in the next section we will use the string “body_mass_g ~ flipper_length_mm * species”. If you wanted to allow the user to provide response and explanatory variables as separate arguments you could include a step in the function that combines them into a formula using paste() or paste0().

With this complete we are now finished putting our functions.R document together. Congratulations!

Here is what the final script will look like:

download_penguins <- function(out_path){
  
  # Instructions adapted from code written by Julien Brun and posted on the
  # palmerpenguins package site:
  # https://allisonhorst.github.io/palmerpenguins/articles/download.html
  
  # Download and compile datasets into a single data frame
  penguin_data <- map_dfr(.x = 
                            c(
                              # Adelie penguin data
                              "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
                              # Gentoo penguin data
                              "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
                              # Chinstrap penguin data
                              "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462"
                            ),
                          .f = read_csv)
  
  # Export data frame as .csv after cleaning up column names for easier use
  write_csv(x = clean_names(penguin_data),
            file = out_path)
  
  return(out_path)
  
}


clean_dataset <- function(file_path){
  
  cleaned_data <- read_csv(file = file_path) %>%
    mutate(
      # Remove unneeded text from column
      species = str_extract(string = species,
                            pattern = "Chinstrap|Adelie|Gentoo"),
      # Extract year from date
      year = year(date_egg)) %>%
    # Select columns of interest
    select(species, island, year, culmen_length_mm, culmen_depth_mm, 
           flipper_length_mm, body_mass_g, sex) %>%
    # Remove rows with incomplete records
    filter(!if_any(everything(),
                   is.na))
  
  return(cleaned_data)
  
}


plot_body_mass <- function(cleaned_data){
  
  # Plot the relationship between flipper length, body mass, and species
  bm_plot <- ggplot(data = cleaned_data,
                    aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(fill = species),
               shape = 21, color = "black", size = 2.5, alpha = 0.6) +
    geom_smooth(aes(color = species), method = "lm", se = FALSE) +
    scale_fill_viridis_d("Species", option = "viridis") +
    scale_color_viridis_d("Species", option = "viridis") +
    xlab("Flipper length (mm)") +
    ylab("Body mass (g)") +
    theme_bw()
  
  return(bm_plot)
  
}


run_model <- function(cleaned_data, model_string){
  
  penguin_model <- lm(data = cleaned_data,
                      formula = model_string)
  
  return(penguin_model)
  
}