Questions tagged [pipelines]

A pipeline is a sequence of functions (or the equivalent thereof), composed so that the output of one is input for the next, in order to create a compound transformation. Famously, a shell pipeline looks like "command | command2 | command3" (but use the tag "pipe" for this). It's also used in computer architecture to define a sequence of serial stages that execute in parallel over elements being fed into a pipe, in order to increase the overall throughput.

In a command line interface or shell, a pipeline uses the pipe operator ("|") to take output from one function or command and input it to another. This is done in a series like "command1 | function1 | command2". For questions related to the pipe operator use the pipe tag.

In computer architecture, a pipeline is a process consisting of a sequence of stages that must be performed in serial order over each element passing the pipe, but may execute in parallel over the elements inside, such that the overall throughput does not depend on the length of the pipe. This is utilized by most CPUs' hardware to process instructions.

A similar technique is also done in software (software-pipelining) in order to optimize the parallelism of a given loop by reordering it to arrange data dependencies in a pipelined manner.

100 questions

votes

2 answers

Possible harm in standardizing one-hot encoded features

While there may not be any added value in standardizing one-hot encoded features prior to applying linear models, is there is any harm in doing so (i.e., affecting model performance)? Standardizing definition: applying (x - mean) / std to make the…

asked Aug 13 '20 at 14:27

thereandhere1

votes

1 answer

how to pass parameters over sklearn pipeline's stages?

I'm working on a deep neural model for text classification using Keras. To fine tune some hyperparameters i'm using Keras Wrappers for the Scikit-Learn API. So I builded a Sklearn Pipeline for that: def create_model(optimizer="adam",…

python scikit-learn hyperparameter-tuning grid-search pipelines

asked Sep 12 '19 at 09:49

Amine Benatmane

votes

0 answers

Tensorflow v1 Dataset API AttributeError with ndim

I'd like to make pipeline for optimizing Gpu and Cpu. Dataset It's about 10000 datapoint and 4 description variables for the regression problem. df = pd.read_csv("dataset") X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1].values,…

keras tensorflow pipelines

asked Feb 22 '19 at 17:39

AutomaKen

votes

1 answer

Difference between sklearn make_pipeline and imblearn make_pipeline

Can anybody please explain the difference between sklearn.pipeline.make_pipline and imblearn.pipeline.make_pipline.

predictive-modeling pandas class-imbalance pipelines imbalanced-learn

asked Aug 21 '19 at 06:45

boredaf

votes

2 answers

Model ensemble with Spark or Scikit Learn

I am using Spark MLLib to make prediction and I would like to know if it is possible to create your custom Estimators. Here is a reproducible of what I would like my model to do with the Spark api from sklearn.datasets import load_diabetes import…

pyspark pipelines

asked Apr 15 '19 at 14:04

Robin Nicole

votes

3 answers

What is the meaning of the term "pipeline" within data science?

People often refer to pipelines when talking about models, data and even layers in a neural network. What can be meant by a pipeline?

data ensemble-modeling pipelines definitions

asked Jul 20 '18 at 15:02

n1k31t4

14,663
2
28
49

votes

1 answer

Data Science Pipelines vs Common CD/CL

What is the advantage of Data Science Specific CI/CD (kubeflow, Algo, TFX, mlflow, sagemaker pipelines) vs the already baked flavors that are more generic: Jenkins, Bamboo, Airflow, Google Cloud Build, ... My guess is the Data Science ones give…

machine-learning data-science-model pipelines automation

asked Jan 29 '20 at 00:51

brianray

votes

3 answers

R has {drake} which makes it easy to make reproducible data pipelines. Does Python have a similar package?

See R's {drake}. It allows you to define a reproducible pipeline plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), data = raw_data %>% mutate(Species = forcats::fct_inorder(Species)), hist = create_plot(data), …

python r pipelines

asked Oct 01 '19 at 04:09

xiaodai

votes

1 answer

How to apply dataset balancing techniques whilst using Pipeline in Sklearn?

I am new to Machine Learning and trying to construct machine learning models that adhere to good practice and not susceptible to biases. I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage. I am…

machine-learning scikit-learn class-imbalance pipelines

asked Apr 04 '20 at 18:04

Hamish Gibson

votes

2 answers

Scikit-learn pipeline with scaling, dimensionality reduction, average prediction of multiple regression models, and grid search cross validation

I would like to use a sklearn pipeline doing this : ( - ) scale the data ( StandardScaler ) ( - ) reduce dimensionality ( PCA ) ( - ) make a prediction with GradientBoostingRegressor() and GridSearchCV() ( to get the model with best parameters from…

scikit-learn prediction dimensionality-reduction feature-scaling pipelines

asked Apr 29 '19 at 17:48

Fabrice BOUCHAREL

votes

1 answer

What can I do when my test and validation scores are good, but the submission is terrible?

This is a very broad question, I understand and I'm totally fine if someone believes it's not appropriate to do it. But it's killing me not to understand this... Here's the thing, I'm doing a machine learning model to predict the tweet topic. I'm…

nlp overfitting pipelines data-leakage

asked Sep 27 '21 at 15:37

Yuxxxxxx

votes

1 answer

Why GridSearchCV returns nan?

I am using gridsearchcv to tune the parameters of my model and I also use pipeline and cross-validation. When I run the model to tune the parameter of XGBoost, it returns nan. However, when I use the same code for other classifiers like random…

cross-validation pipelines gridsearchcv

asked Mar 27 '21 at 00:02

Amin

votes

2 answers

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning,…

python scikit-learn data-cleaning preprocessing pipelines

asked Dec 11 '20 at 18:25

LazyEval

votes

1 answer

How to restrict the columns to be passed to final classifier in PMML Pipeline

I am working on building XGBoost PMML using SKLearn and SKLearn2PMML. I am having some numerical,somecategorical and datetime columns from which i am creating new feature inside the pipeline. When i am trying to train the model, it gets failed as…

python scikit-learn pipelines

asked Jul 15 '20 at 08:19

Akshay Tilekar

votes

1 answer

How to use ColumnTransformer and FunctionTransformer to apply the same function to many columns, but separately?

I want to apply pd.cut as a transformer in a pipeline, like this: numerical_preprocessing = Pipeline([ ('cut_into_bins', FunctionTransformer(pd.cut, kw_args={'bins': [10, 100, 1000]}) )] However, I get an error:…

scikit-learn pandas feature-engineering pipelines

asked Jun 10 '20 at 14:48

JohnnyQ

2 3 4 5 6 7 Next