5

See R's {drake}. It allows you to define a reproducible pipeline

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

# call the pipeline
make(plan)

The great thing about drake is you that you can reload any of raw_data, data, hist, fit, report at any point. And if you change part of the code and make(plan) and {drake} will figure out which has change and just run that.

xiaodai
  • 620
  • 1
  • 5
  • 12
  • I see a contributor with a similar name for this project. Are you advertising your project here? Is this not against the rules? – Valentas Oct 01 '19 at 08:29
  • Wow! I am asking for a Python equivalent cos I want to do the same. Also, u saw that I only made one contribution which was minor. Also the author of drake included disk.frame which was one of my package – xiaodai Oct 01 '19 at 08:47
  • The question is still open. Can you accept my answer? If its not satisfactory you can comment and I can try to help you more^^ – Ilker Kurtulus Nov 27 '19 at 08:18

3 Answers3

3

Sklearn has pipeline. If you have fit and transform attributes iteratively, you can make them pipeline by Pipeline class in sklearn.pipeline.

Read the docs:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Additionally you can save and load a pipeline object by joblib.dump and joblib.load.

Ilker Kurtulus
  • 866
  • 1
  • 5
  • 13
2

For larger projects snakemake is a way to go for Python (it extends Python syntax, valid Python is valid snakemake). It originates in bioinformatics and even has its own publication; it is widley adopted and used by many projects (see the literature list in the first link or the citations for the linked article).

For Jupyter notebook based projects, I made an experiment called nbpipeline which you may be interested in.

krassowski
  • 121
  • 2
1

Ploomber works the same way, it keeps track of your source code and it only runs outdated steps to bring your pipeline up-to-date: https://github.com/ploomber/ploomber

Disclaimer: I'm the project's author

Eduardo
  • 111
  • 3