Chapter 8 Building the workflow around functions
8.1 Writing functions
Next let’s get into the meat of our analysis by writing the functions.R script. This is going to be our longest script in the workflow.
As a reminder, this is what our target list currently looks like:
list(
tar_target(name = penguin_data,
command = ),
tar_file(penguin_file,
),
tar_target(cleaned_data,
),
tar_target(exploratory_plot,
),
tar_target(penguin_model,
),
tar_render(markdown_summary,
)
)
We can start out by building out the function that creates the target called penguin_data
. Below is the code that was used to download the dataset we used in the function introduction section. This can be repurposed into our target.
# Instructions adapted from code written by Julien Brun and posted on the
# palmerpenguins package site:
# https://allisonhorst.github.io/palmerpenguins/articles/download.html
library(tidyverse)
library(janitor)
# Download and compile datasets into a single data frame
<- map_dfr(.x =
penguin_data c(
# Adelie penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
# Gentoo penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
# Chinstrap penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462"
),.f = read_csv)
# Export data frame as .csv after cleaning up column names for easier use
write_csv(x = clean_names(penguin_data),
file = "example_project/data/penguin_data.csv")
Remade as a function for use in the pipeline:
<- function(out_path){
download_penguins
# Instructions adapted from code written by Julien Brun and posted on the
# palmerpenguins package site:
# https://allisonhorst.github.io/palmerpenguins/articles/download.html
# Download and compile datasets into a single data frame
<- map_dfr(.x =
penguin_data c(
# Adelie penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
# Gentoo penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
# Chinstrap penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462"
),.f = read_csv)
# Export data frame as .csv after cleaning up column names for easier use
write_csv(x = clean_names(penguin_data),
file = out_path)
return(out_path)
}
Let’s inspect the changes that have been made to the script in order to turn it into a function in our workflow. The first step was to decide on a function name, using a verb to describe the action the function is doing, and wrap the code in a call to function(){}
. Next we included an argument to the function, out_path
, which asks the user to provide a file path for where they want the penguin .csv file to be saved. This path is used in the write_csv()
function, and then later returned as output from download_pengions()
so that the file path can be tracked by the pipeline. We’ve also removed the library()
calls from this function; packages can instead be loaded using tar_target()
in the main _targets.R
script.
You can copy the above function definition as-is into your functions.R
script in the R/
folder. Let’s move on to a function that cleans up the data.
This dataset doesn’t need a ton of cleaning before we begin using it. However, we might want to make some changes that will simplify it.
<- function(file_path){
clean_dataset
<- read_csv(file = file_path) %>%
cleaned_data mutate(
# Remove unneeded text from column
species = str_extract(string = species,
pattern = "Chinstrap|Adelie|Gentoo"),
# Extract year from date
year = year(date_egg)) %>%
# Select columns of interest
select(species, island, year, culmen_length_mm, culmen_depth_mm,
%>%
flipper_length_mm, body_mass_g, sex) # Remove rows with incomplete records
filter(!if_any(everything(),
is.na))
return(cleaned_data)
}
This new version of the dataset should be a little more user friendly for analysis now. We can paste this into our functions.R
script. Also note that we are returning the dataframe cleaned_data
from the function, but not writing it to a new .csv file. This is totally optional, but in this case we’ll just document the original, “raw” data in .csv format.
Now with clean data in hand we want to investigate a hypothetical relationship between body mass, flipper length, and penguin species. We’ll put together a figure that displays flipper_length_mm
on the x-axis and body_mass_g
on the y-axis with points and overlaid lines colored by species. Because we’re going to put together an R Markdown report at the end of the pipeline we won’t export this as a separate image file. However you could do so if desired, following the steps we used for exporting the .csv file previously.
<- function(cleaned_data){
plot_body_mass
# Plot the relationship between flipper length, body mass, and species
<- ggplot(data = cleaned_data,
bm_plot aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(fill = species),
shape = 21, color = "black", size = 2.5, alpha = 0.6) +
geom_smooth(aes(color = species), method = "lm", se = FALSE) +
scale_fill_viridis_d("Species", option = "viridis") +
scale_color_viridis_d("Species", option = "viridis") +
xlab("Flipper length (mm)") +
ylab("Body mass (g)") +
theme_bw()
return(bm_plot)
}
Notice that we are now referring to cleaned_data
as the input to this function. You could name this argument whatever you want, but being clear about which upstream targets or data sources are being fed into downstream steps can help keep the workflow organized.
Now having coded up a plot that visualizes the relationship we’re interested in, let’s describe it with a model as well. This function will be pretty minimal:
<- function(cleaned_data, model_string){
run_model
<- lm(data = cleaned_data,
penguin_model formula = model_string)
return(penguin_model)
}
Here we are asking the user for a model_string
as input, by which we mean a character string written in modeling formula format. For example, when we use this function in the next section we will use the string “body_mass_g ~ flipper_length_mm * species”. If you wanted to allow the user to provide response and explanatory variables as separate arguments you could include a step in the function that combines them into a formula using paste()
or paste0()
.
With this complete we are now finished putting our functions.R
document together. Congratulations!
Here is what the final script will look like:
<- function(out_path){
download_penguins
# Instructions adapted from code written by Julien Brun and posted on the
# palmerpenguins package site:
# https://allisonhorst.github.io/palmerpenguins/articles/download.html
# Download and compile datasets into a single data frame
<- map_dfr(.x =
penguin_data c(
# Adelie penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
# Gentoo penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
# Chinstrap penguin data
"https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462"
),.f = read_csv)
# Export data frame as .csv after cleaning up column names for easier use
write_csv(x = clean_names(penguin_data),
file = out_path)
return(out_path)
}
<- function(file_path){
clean_dataset
<- read_csv(file = file_path) %>%
cleaned_data mutate(
# Remove unneeded text from column
species = str_extract(string = species,
pattern = "Chinstrap|Adelie|Gentoo"),
# Extract year from date
year = year(date_egg)) %>%
# Select columns of interest
select(species, island, year, culmen_length_mm, culmen_depth_mm,
%>%
flipper_length_mm, body_mass_g, sex) # Remove rows with incomplete records
filter(!if_any(everything(),
is.na))
return(cleaned_data)
}
<- function(cleaned_data){
plot_body_mass
# Plot the relationship between flipper length, body mass, and species
<- ggplot(data = cleaned_data,
bm_plot aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(fill = species),
shape = 21, color = "black", size = 2.5, alpha = 0.6) +
geom_smooth(aes(color = species), method = "lm", se = FALSE) +
scale_fill_viridis_d("Species", option = "viridis") +
scale_color_viridis_d("Species", option = "viridis") +
xlab("Flipper length (mm)") +
ylab("Body mass (g)") +
theme_bw()
return(bm_plot)
}
<- function(cleaned_data, model_string){
run_model
<- lm(data = cleaned_data,
penguin_model formula = model_string)
return(penguin_model)
}