Questions tagged [pipelines]

A pipeline is a sequence of functions (or the equivalent thereof), composed so that the output of one is input for the next, in order to create a compound transformation. Famously, a shell pipeline looks like "command | command2 | command3" (but use the tag "pipe" for this). It's also used in computer architecture to define a sequence of serial stages that execute in parallel over elements being fed into a pipe, in order to increase the overall throughput.

In a command line interface or shell, a pipeline uses the pipe operator ("|") to take output from one function or command and input it to another. This is done in a series like "command1 | function1 | command2". For questions related to the pipe operator use the pipe tag.

In computer architecture, a pipeline is a process consisting of a sequence of stages that must be performed in serial order over each element passing the pipe, but may execute in parallel over the elements inside, such that the overall throughput does not depend on the length of the pipe. This is utilized by most CPUs' hardware to process instructions.

A similar technique is also done in software (software-pipelining) in order to optimize the parallelism of a given loop by reordering it to arrange data dependencies in a pipelined manner.

100 questions
7
votes
2 answers

Possible harm in standardizing one-hot encoded features

While there may not be any added value in standardizing one-hot encoded features prior to applying linear models, is there is any harm in doing so (i.e., affecting model performance)? Standardizing definition: applying (x - mean) / std to make the…
7
votes
1 answer

how to pass parameters over sklearn pipeline's stages?

I'm working on a deep neural model for text classification using Keras. To fine tune some hyperparameters i'm using Keras Wrappers for the Scikit-Learn API. So I builded a Sklearn Pipeline for that: def create_model(optimizer="adam",…
7
votes
0 answers

Tensorflow v1 Dataset API AttributeError with ndim

I'd like to make pipeline for optimizing Gpu and Cpu. Dataset It's about 10000 datapoint and 4 description variables for the regression problem. df = pd.read_csv("dataset") X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1].values,…
AutomaKen
  • 71
  • 1
  • 2
6
votes
1 answer

Difference between sklearn make_pipeline and imblearn make_pipeline

Can anybody please explain the difference between sklearn.pipeline.make_pipline and imblearn.pipeline.make_pipline.
6
votes
2 answers

Model ensemble with Spark or Scikit Learn

I am using Spark MLLib to make prediction and I would like to know if it is possible to create your custom Estimators. Here is a reproducible of what I would like my model to do with the Spark api from sklearn.datasets import load_diabetes import…
Robin Nicole
  • 499
  • 3
  • 13
6
votes
3 answers

What is the meaning of the term "pipeline" within data science?

People often refer to pipelines when talking about models, data and even layers in a neural network. What can be meant by a pipeline?
n1k31t4
  • 14,663
  • 2
  • 28
  • 49
5
votes
1 answer

Data Science Pipelines vs Common CD/CL

What is the advantage of Data Science Specific CI/CD (kubeflow, Algo, TFX, mlflow, sagemaker pipelines) vs the already baked flavors that are more generic: Jenkins, Bamboo, Airflow, Google Cloud Build, ... My guess is the Data Science ones give…
5
votes
3 answers

R has {drake} which makes it easy to make reproducible data pipelines. Does Python have a similar package?

See R's {drake}. It allows you to define a reproducible pipeline plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), data = raw_data %>% mutate(Species = forcats::fct_inorder(Species)), hist = create_plot(data), …
xiaodai
  • 620
  • 1
  • 5
  • 12
4
votes
1 answer

How to apply dataset balancing techniques whilst using Pipeline in Sklearn?

I am new to Machine Learning and trying to construct machine learning models that adhere to good practice and not susceptible to biases. I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage. I am…
4
votes
2 answers

Scikit-learn pipeline with scaling, dimensionality reduction, average prediction of multiple regression models, and grid search cross validation

I would like to use a sklearn pipeline doing this : ( - ) scale the data ( StandardScaler ) ( - ) reduce dimensionality ( PCA ) ( - ) make a prediction with GradientBoostingRegressor() and GridSearchCV() ( to get the model with best parameters from…
4
votes
1 answer

What can I do when my test and validation scores are good, but the submission is terrible?

This is a very broad question, I understand and I'm totally fine if someone believes it's not appropriate to do it. But it's killing me not to understand this... Here's the thing, I'm doing a machine learning model to predict the tweet topic. I'm…
Yuxxxxxx
  • 141
  • 2
3
votes
1 answer

Why GridSearchCV returns nan?

I am using gridsearchcv to tune the parameters of my model and I also use pipeline and cross-validation. When I run the model to tune the parameter of XGBoost, it returns nan. However, when I use the same code for other classifiers like random…
Amin
  • 191
  • 3
  • 9
2
votes
2 answers

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning,…
2
votes
1 answer

How to restrict the columns to be passed to final classifier in PMML Pipeline

I am working on building XGBoost PMML using SKLearn and SKLearn2PMML. I am having some numerical,somecategorical and datetime columns from which i am creating new feature inside the pipeline. When i am trying to train the model, it gets failed as…
2
votes
1 answer

How to use ColumnTransformer and FunctionTransformer to apply the same function to many columns, but separately?

I want to apply pd.cut as a transformer in a pipeline, like this: numerical_preprocessing = Pipeline([ ('cut_into_bins', FunctionTransformer(pd.cut, kw_args={'bins': [10, 100, 1000]}) )] However, I get an error:…
JohnnyQ
  • 201
  • 2
  • 5
1
2 3 4 5 6 7