31
Manage your analyses workows with the drake R package Grenoble R Users Group Xavier Laviron December 6, 2018

Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Manage your analyses work�owswith the drake R package

Grenoble R Users Group

Xavier Laviron

December 6, 2018

Page 2: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

A data analyst’s job

1/23

Page 3: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

A data analyst’s job

2/23

Page 4: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

A data analyst’s job

3/23

Page 5: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

A data analyst’s job

4/23

Page 6: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

A data analyst’s job

2 options:

I Run everything from scratch (simple, but can be too long. . . )I Track the dependencies between your objects (boring, perfect

job for a pipeline toolkit. . . )

5/23

Page 7: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The drake package is here to help youWhy use drake?

I Keeps track of dependencies in your codeI Keeps track of changes in your codeI Runs only what needs to be run, and skip the restI It has a cool name :-)

In other words, ‘drake‘ can save a lot of time!**more time for co�ee breaks

6/23

Page 8: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The drake package is here to help youWhy use drake?

I Keeps track of dependencies in your codeI Keeps track of changes in your codeI Runs only what needs to be run, and skip the restI It has a cool name :-)

In other words, ‘drake‘ can save a lot of time!**more time for co�ee breaks

6/23

Page 9: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

drake tracks changes in functions

Encapsulate your code in functions:

# Process the dataprocess_data <- function(raw.data) {

raw.data[raw.data$Sepal.Length > 5, ]}

# fit a modelfit_model <- function(data) {

lm(Sepal.Length ~ Petal.Width + Species, data = data)}

# create plotscreate_plot <- function(data) {

ggplot(data, aes(x = Petal.Width, fill = Species)) +geom_histogram()

}

7/23

Page 10: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The plan

The central piece of ‘drake‘: the work�ow plan

The plan is a simple data.frame with two columns:

I target: the objects you want to buildI command: the functions to build them

Di�erent ways to create the plan:

I Like any data.frame: data.frame(), expand.grid(), . . .I With one of drake’s helper functions: drake_plan(),

evaluate_plan(), . . .

8/23

Page 11: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The plan

The central piece of ‘drake‘: the work�ow plan

The plan is a simple data.frame with two columns:

I target: the objects you want to buildI command: the functions to build them

Di�erent ways to create the plan:

I Like any data.frame: data.frame(), expand.grid(), . . .I With one of drake’s helper functions: drake_plan(),

evaluate_plan(), . . .

8/23

Page 12: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The drake_plan() function

Usage:

drake_plan(target1 = command1, target2 = command2, ...)

my.plan <-drake_plan(raw.data = read.csv(file_in("data/raw_data.csv")),

proc.data = process_data(raw.data),

plot = create_plot(proc.data),

model = fit_model(proc.data),

report = render(input = knitr_in("report.Rmd"),output_file = file_out("report.pdf"),quiet = TRUE))

9/23

Page 13: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The drake_plan() function

Usage:

drake_plan(target1 = command1, target2 = command2, ...)

my.plan <-drake_plan(raw.data = read.csv(file_in("data/raw_data.csv")),

proc.data = process_data(raw.data),

plot = create_plot(proc.data),

model = fit_model(proc.data),

report = render(input = knitr_in("report.Rmd"),output_file = file_out("report.pdf"),quiet = TRUE))

9/23

Page 14: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The drake_plan() function

print(my.plan)

## # A tibble: 5 x 2## target command## * <chr> <chr>## 1 raw.data read.csv(file_in('data/raw_data.csv'))## 2 proc.data process_data(raw.data)## 3 plot create_plot(proc.data)## 4 model fit_model(proc.data)## 5 report "render(input = knitr_in('report.Rmd'), output_file = file_ou~

10/23

Page 15: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Files dependencies

Files are not tracked by drake, you have to declare them explicitlyas dependencies:

I file_in("some_data.csv"): an input �leI file_out("some_data.Rds"): an output �leI knitr_in("report.Rmd"): an rmarkdown �le, drake will

scan it to �nd its dependencies

11/23

Page 16: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The dependency graphvis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependencygraph

12/23

Page 17: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The make() commandThe central command of drake, runs everything that needs to run.

make(my.plan)

vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependencygraph

13/23

Page 18: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

The make() commandThe central command of drake, runs everything that needs to run.

make(my.plan)

vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependencygraph

13/23

Page 19: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Accessing the objects

All objects are stored in a hidden cache (.drake/). To access them:

loadd(model)model <- readd(model)

print(readd(model))

#### Call:## lm(formula = Sepal.Length ~ Petal.Width + Species, data = data)#### Coefficients:## (Intercept) Petal.Width Speciesversicolor## 5.13118 0.65802 -0.01955## Speciesvirginica## 0.15373

14/23

Page 20: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Accessing the objects

All objects are stored in a hidden cache (.drake/). To access them:

loadd(model)model <- readd(model)

print(readd(model))

#### Call:## lm(formula = Sepal.Length ~ Petal.Width + Species, data = data)#### Coefficients:## (Intercept) Petal.Width Speciesversicolor## 5.13118 0.65802 -0.01955## Speciesvirginica## 0.15373

14/23

Page 21: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

An update in the code!What happens if we modify a function?create_plot <- function(data) {

ggplot(data, aes(x = Petal.Width, y = Sepal.Width, fill = Species)) +geom_point()

}

vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependencygraph

15/23

Page 22: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

An update in the code!What happens if we modify a function?create_plot <- function(data) {

ggplot(data, aes(x = Petal.Width, y = Sepal.Width, fill = Species)) +geom_point()

}

vis_drake_graph(drake_config(my.plan), from = "raw.data")

Dependencygraph

15/23

Page 23: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Other advantages

ReproducibiltyYou have proof of what is done:

make(my.plan)

## All targets are already up to date.

16/23

Page 24: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Other advantagesIndependant replication is made easy

I Your code is separated into functions: more readability andmaintainability

I The plan allows an independent user to easily understand theanalyses

I Restart everything from scratch easily:

outdated(drake_config(my.plan))

## character(0)

clean()outdated(drake_config(my.plan))

## [1] "model" "plot" "proc.data" "raw.data" "report"

17/23

Page 25: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Parallelization

I drake can manage multi-core computing (on a local machine ora HPC)

I Simply change the jobs argument of make():

make(my.plan, jobs = 2)

I drake will automatically know which targets can be run inparallel and which cannot

18/23

Page 26: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Parallelization

I drake can manage multi-core computing (on a local machine ora HPC)

I Simply change the jobs argument of make():

make(my.plan, jobs = 2)

I drake will automatically know which targets can be run inparallel and which cannot

18/23

Page 27: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Ressources

To go furtherhttps://github.com/ropensci/drake

I Online documentationI CheatsheetI FAQ

The package is in active development and there are a lot of otherfunctionnalities

19/23

Page 28: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Exercices

There exists a bunch of built-in examples, you can list them with:

drake_examples()

And then load one with:

drake_example("example_name")

This will create a directory with all the necessary �les, that you canopen in the IDE of your choice (Rstudio, vim, . . . ).

20/23

Page 29: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Exercice: The basic example

The most accessible example for beginners

drake_example("main")

21/23

Page 30: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Exercice: The mtcars example

This chapter is a walkthrough of drake’s main functionality basedon the mtcars example. It sets up the project and runs it repeatedlyto demonstrate drake’s most important functionality.

drake_example("mtcars")

22/23

Page 31: Manage your analyses work˚ows with thedrakeR package · 2020-02-14 · I Cheatsheet I FAQ The package is in active development and there are a lot of other functionnalities 19/23

Exercice: An analysis of R package downloadtrends

This example explores R package download trends using thecranlogs package, and it shows how drake’s custom triggers canhelp with work�ows with remote data sources.

drake_example("packages")

23/23