Chapter 11 Running and updating the pipeline
11.1 Run the pipeline
We now have a complete targets
pipeline written, but it still needs to be run. We can make sure that the pipeline is ready to use with tar_validate()
.
tar_validate()
This function returns a NULL
value if no problems are found in the targets
pipeline. So if you just get a blank in response, everything is working as intended.
Now that we have this reassurance, let’s run tar_make()
. This will build the pipeline.
tar_make()
As the pipeline is built, messages will be returned to you via the RStudio Console. It should look something like this if it’s working correctly:
* start target penguin_data
Rows: 152 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
-
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 124 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
-
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 68 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
-
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
* built target penguin_data
* start target penguin_file
* built target penguin_file
* start target cleaned_data
Rows: 344 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (9): study_name, species, region, island, stage, individual_id, clutch_...
dbl (7): sample_number, culmen_length_mm, culmen_depth_mm, flipper_length_m...
date (1): date_egg
/
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
* built target cleaned_data
* start target exploratory_plot
* built target exploratory_plot
* start target penguin_model
* built target penguin_model
* start target analysis_report
* built target analysis_report
* end pipeline
A lot of what’s printed above is just the output from read_csv()
calls. Below are the highlights of the main things happenings in the targets
build process:
* start target penguin_data
* built target penguin_data
* start target penguin_file
* built target penguin_file
* start target cleaned_data
* built target cleaned_data
* start target exploratory_plot
* built target exploratory_plot
* start target penguin_model
* built target penguin_model
* start target analysis_report
* built target analysis_report
* end pipeline
Hopefully these messages make sense to you. They are the steps being taken by targets
to build the pipeline. One-by-one, in an order determined by each step’s dependencies, targets are started and then a message is returned when they are completed (“built”). Once everything has been run, including our R Markdown, the pipeline is complete: * end pipeline
.
11.2 Review outputs
Go ahead and find R/analysis_report.html
and open it up. It should look something like the image below if things worked properly.
If you have any questions about your pipeline outputs or just want to investigate them manually, you can use the tar_load()
and tar_read()
functions again. For example, maybe we want to review what the cleaned_data
target looks like.
tar_read(cleaned_data) %>%
head()
species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | FEMALE |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | FEMALE |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | MALE |
Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | FEMALE |
11.3 Updating the pipeline
Inevitably you will want to make changes to this pipeline after you finish your initial draft of it. All research and data science projects receive edits at some point, whether due to a change in plans or an error that needs to be fixed.
targets
is built to make it easy to change complex analytical workflows. It will notice when we’ve made changes and will update any parts of our pipeline that need to be edited in response to upstream changes.
Right now, our completed pipeline looks like this, with green symbols denoting targets that were completed:
Let’s say that upon our review of cleaned_data
above we decided that we don’t need to include a year
column, because we aren’t actually using it. We can go back and change the function definition in R/functions.R
to this:
<- function(file_path){
clean_dataset
<- read_csv(file = file_path) %>%
cleaned_data mutate(
# Remove unneeded text from column
species = str_extract(string = species,
pattern = "Chinstrap|Adelie|Gentoo")) %>%
# Select columns of interest
select(species, island, culmen_length_mm, culmen_depth_mm,
%>%
flipper_length_mm, body_mass_g, sex) # Remove rows with incomplete records
filter(!if_any(everything(),
is.na))
return(cleaned_data)
}
We can also remove packages = c("tidyverse", "lubridate")
from the cleaned_data
target in _targets.R
because it no longer needs lubridate
and tidyverse
will be loaded by default.
Let’s check in on our pipeline with tar_visnetwork()
again:
Now there are light blue symbols to indicate targets that are out of date due to our changes to the clean_dataset()
function. targets
sees that we made these changes and tracks their downstream consequences in the workflow.
Run tar_make()
and see what’s remade:
v skip target penguin_data
v skip target penguin_file
* start target cleaned_data
Rows: 344 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (9): study_name, species, region, island, stage, individual_id, clutch_...
dbl (7): sample_number, culmen_length_mm, culmen_depth_mm, flipper_length_m...
date (1): date_egg
/
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
* built target cleaned_data
* start target exploratory_plot
* built target exploratory_plot
* start target penguin_model
* built target penguin_model
* start target analysis_report
* built target analysis_report
* end pipeline
Note the skip
messages above. targets
is able to detect that penguin_data
and penguin_file
haven’t been changed during our edits, and they’re upstream of cleaned_data
, so they can be left alone. This kind of dependency tracking process can save a lot of time in larger analytical projects.
11.4 Invalidating the pipeline
In some cases you may want to rebuild a completed pipeline from scratch. For example, maybe you want to check to see whether values have changed or if you can reproduce someone else’s work. To do this you can run tar_destroy()
. Alternatively, to just force some or all of your targets to re-run, you can use tar_invalidate()
and provide the names of the targets you’d like to rebuild.
tar_destroy()
Delete _targets? (Set the TAR_ASK environment variable to "false" to disable this menu, e.g. usethis::edit_r_environ().)
1: yes
2: no
Selection: 1
>
Now we can check to confirm what needs rebuilding using tar_outdated()
:
tar_outdated()
[1] "analysis_report" "cleaned_data" "penguin_file" "penguin_data" "exploratory_plot"
[6] "penguin_model"