Chapter 11 Running and updating the pipeline

11.1 Run the pipeline

We now have a complete targets pipeline written, but it still needs to be run. We can make sure that the pipeline is ready to use with tar_validate().

tar_validate()

This function returns a NULL value if no problems are found in the targets pipeline. So if you just get a blank in response, everything is working as intended.

Now that we have this reassurance, let’s run tar_make(). This will build the pipeline.

tar_make()

As the pipeline is built, messages will be returned to you via the RStudio Console. It should look something like this if it’s working correctly:

* start target penguin_data
Rows: 152 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
-
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 124 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
-
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 68 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
-
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
* built target penguin_data
* start target penguin_file
* built target penguin_file
* start target cleaned_data
Rows: 344 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (9): study_name, species, region, island, stage, individual_id, clutch_...
dbl  (7): sample_number, culmen_length_mm, culmen_depth_mm, flipper_length_m...
date (1): date_egg
/
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
* built target cleaned_data
* start target exploratory_plot
* built target exploratory_plot
* start target penguin_model
* built target penguin_model
* start target analysis_report
* built target analysis_report
* end pipeline

A lot of what’s printed above is just the output from read_csv() calls. Below are the highlights of the main things happenings in the targets build process:

* start target penguin_data
* built target penguin_data
* start target penguin_file
* built target penguin_file
* start target cleaned_data
* built target cleaned_data
* start target exploratory_plot
* built target exploratory_plot
* start target penguin_model
* built target penguin_model
* start target analysis_report
* built target analysis_report
* end pipeline

Hopefully these messages make sense to you. They are the steps being taken by targets to build the pipeline. One-by-one, in an order determined by each step’s dependencies, targets are started and then a message is returned when they are completed (“built”). Once everything has been run, including our R Markdown, the pipeline is complete: * end pipeline.

11.2 Review outputs

Go ahead and find R/analysis_report.html and open it up. It should look something like the image below if things worked properly.

If you have any questions about your pipeline outputs or just want to investigate them manually, you can use the tar_load() and tar_read() functions again. For example, maybe we want to review what the cleaned_data target looks like.

tar_read(cleaned_data) %>%
  head()

species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	Torgersen	39.1	18.7	181	3750	MALE
Adelie	Torgersen	39.5	17.4	186	3800	FEMALE
Adelie	Torgersen	40.3	18.0	195	3250	FEMALE
Adelie	Torgersen	36.7	19.3	193	3450	FEMALE
Adelie	Torgersen	39.3	20.6	190	3650	MALE
Adelie	Torgersen	38.9	17.8	181	3625	FEMALE

11.3 Updating the pipeline

Inevitably you will want to make changes to this pipeline after you finish your initial draft of it. All research and data science projects receive edits at some point, whether due to a change in plans or an error that needs to be fixed.

targets is built to make it easy to change complex analytical workflows. It will notice when we’ve made changes and will update any parts of our pipeline that need to be edited in response to upstream changes.

Right now, our completed pipeline looks like this, with green symbols denoting targets that were completed:

Let’s say that upon our review of cleaned_data above we decided that we don’t need to include a year column, because we aren’t actually using it. We can go back and change the function definition in R/functions.R to this:

clean_dataset <- function(file_path){
  
  cleaned_data <- read_csv(file = file_path) %>%
    mutate(
      # Remove unneeded text from column
      species = str_extract(string = species,
                            pattern = "Chinstrap|Adelie|Gentoo")) %>%
    # Select columns of interest
    select(species, island, culmen_length_mm, culmen_depth_mm, 
           flipper_length_mm, body_mass_g, sex) %>%
    # Remove rows with incomplete records
    filter(!if_any(everything(),
                   is.na))
  
  return(cleaned_data)
  
}

We can also remove packages = c("tidyverse", "lubridate") from the cleaned_data target in _targets.R because it no longer needs lubridate and tidyverse will be loaded by default.

Let’s check in on our pipeline with tar_visnetwork() again:

Now there are light blue symbols to indicate targets that are out of date due to our changes to the clean_dataset() function. targets sees that we made these changes and tracks their downstream consequences in the workflow.

Run tar_make() and see what’s remade:

v skip target penguin_data
v skip target penguin_file
* start target cleaned_data
Rows: 344 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (9): study_name, species, region, island, stage, individual_id, clutch_...
dbl  (7): sample_number, culmen_length_mm, culmen_depth_mm, flipper_length_m...
date (1): date_egg
/
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
* built target cleaned_data
* start target exploratory_plot
* built target exploratory_plot
* start target penguin_model
* built target penguin_model
* start target analysis_report
* built target analysis_report
* end pipeline

Note the skip messages above. targets is able to detect that penguin_data and penguin_file haven’t been changed during our edits, and they’re upstream of cleaned_data, so they can be left alone. This kind of dependency tracking process can save a lot of time in larger analytical projects.

11.4 Invalidating the pipeline

In some cases you may want to rebuild a completed pipeline from scratch. For example, maybe you want to check to see whether values have changed or if you can reproduce someone else’s work. To do this you can run tar_destroy(). Alternatively, to just force some or all of your targets to re-run, you can use tar_invalidate() and provide the names of the targets you’d like to rebuild.

tar_destroy()

Delete _targets? (Set the TAR_ASK environment variable to "false" to disable this menu, e.g. usethis::edit_r_environ().) 

1: yes
2: no

Selection: 1
>

Now we can check to confirm what needs rebuilding using tar_outdated():

tar_outdated()

[1] "analysis_report"  "cleaned_data"     "penguin_file"     "penguin_data"     "exploratory_plot"
[6] "penguin_model"