R has {drake} which makes it easy to make reproducible data pipelines. Does Python have a similar package?

Question

See R's {drake}. It allows you to define a reproducible pipeline

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

# call the pipeline
make(plan)

The great thing about drake is you that you can reload any of raw_data, data, hist, fit, report at any point. And if you change part of the code and make(plan) and {drake} will figure out which has change and just run that.

I see a contributor with a similar name for this project. Are you advertising your project here? Is this not against the rules? — Valentas, Oct 01 '19 at 08:29
Wow! I am asking for a Python equivalent cos I want to do the same. Also, u saw that I only made one contribution which was minor. Also the author of drake included disk.frame which was one of my package — xiaodai, Oct 01 '19 at 08:47
The question is still open. Can you accept my answer? If its not satisfactory you can comment and I can try to help you more^^ — Ilker Kurtulus, Nov 27 '19 at 08:18

Ilker Kurtulus · Answer 1 · 2019-11-27T08:17:01.633

3

Sklearn has pipeline. If you have fit and transform attributes iteratively, you can make them pipeline by Pipeline class in sklearn.pipeline.

Read the docs:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Additionally you can save and load a pipeline object by joblib.dump and joblib.load.

edited Nov 27 '19 at 08:17

answered Oct 01 '19 at 08:13

Ilker Kurtulus

866
1
5
13

1

it's not the same as drake at all – xiaodai Nov 27 '19 at 23:00

score 2 · Answer 2 · answered Oct 03 '19 at 19:25

For larger projects snakemake is a way to go for Python (it extends Python syntax, valid Python is valid snakemake). It originates in bioinformatics and even has its own publication; it is widley adopted and used by many projects (see the literature list in the first link or the citations for the linked article).

For Jupyter notebook based projects, I made an experiment called nbpipeline which you may be interested in.

score 1 · Answer 3 · answered Feb 23 '20 at 23:20

1

Ploomber works the same way, it keeps track of your source code and it only runs outdated steps to bring your pipeline up-to-date: https://github.com/ploomber/ploomber

Disclaimer: I'm the project's author

answered Feb 23 '20 at 23:20

Eduardo

111
3

R has {drake} which makes it easy to make reproducible data pipelines. Does Python have a similar package?

3 Answers3