Questions tagged [scikit-learn]

scikit-learn is a popular machine learning package for Python that has simple and efficient tools for predictive data analysis. Topics include classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

What is scikit-learn?

scikit-learn is a popular machine learning package for Python that has simple and efficient tools for predictive data analysis. Topics include classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. It is built upon NumPy, SciPy, and matplotlib and is open-sourced under the BSD License. It is part of the scientific computation ecosystem and useful for both individual and commercial use.


New to scikit-learn?

There are various resources including books, tutorials/workshops, etc. for those looking to learn how to use scikit-learn.

A popular introductory tutorial is:

SciPy 2018 Conference Tutorial:

A popular introductory book is:

Introduction to Machine Learning with Python, by Andreas C. Müller and Sarah Guido.


Tag usage

When posting questions about scikit-learn, please take the following into consideration:

  • When tagging questions with the tag, users should not use the tag sklearn, despite semantic similarity, as the latter is marked as a synonym and will automatically be retagged.

  • Explicit programming related questions are more suitable for Stack Overflow and should not be posted on Stack Exchange Data Science.

  • Questions should include sufficient details and clarity to be able to provide support for the problem at hand. This includes linking to underlying data used, providing code used for the model's construction, highlighting relevant outputs, etc.


External Resources

scikit-learn: Documentation page

scikit-learn: GitHub page


Important links

2291 questions
240
votes
10 answers

What's the difference between fit and fit_transform in scikit-learn models?

I do not understand the difference between the fit and fit_transform methods in scikit-learn. Can anybody explain simply why we might need to transform data? What does it mean, fitting a model on training data and transforming to test data? Does it…
Kaggle
  • 2,877
  • 5
  • 13
  • 8
191
votes
16 answers

Train/Test/Validation Set Splitting in Sklearn

How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn? As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not…
Hendrik
  • 8,377
  • 17
  • 40
  • 55
172
votes
4 answers

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

I have been building models with categorical data for a while now and when in this situation I basically default to using scikit-learn's LabelEncoder function to transform this data prior to building a model. I understand the difference between OHE,…
anthr
  • 1,843
  • 3
  • 11
  • 11
114
votes
12 answers

SVM using scikit learn runs endlessly and never completes execution

I am trying to run SVR using scikit-learn (python) on a training dataset that has 595605 rows and 5 columns (features) while the test dataset has 397070 rows. The data has been pre-processed and regularized. I am able to successfully run the test…
tejaskhot
  • 3,935
  • 7
  • 20
  • 18
94
votes
10 answers

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I got ValueError when predicting test data using a RandomForest model. My code: clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2) clf.fit(X_fit, y_fit) df_test.fillna(df_test.mean()) X_test = df_test.values y_pred =…
Edamame
  • 2,705
  • 5
  • 23
  • 32
87
votes
6 answers

strings as features in decision tree/random forest

I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I…
59
votes
8 answers

Does scikit-learn have a forward selection/stepwise regression algorithm?

I am working on a problem with too many features and training my models takes way too long. I implemented a forward selection algorithm to choose features. However, I was wondering does scikit-learn have a forward selection/stepwise regression…
Maksud
  • 715
  • 1
  • 7
  • 6
56
votes
4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…
53
votes
2 answers

train_test_split() error: Found input variables with inconsistent numbers of samples

Fairly new to Python but building out my first RF model based on some classification data. I've converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the…
josh_gray
  • 633
  • 1
  • 5
  • 4
50
votes
3 answers

Understanding predict_proba from MultiOutputClassifier

I'm following this example on the scikit-learn website to perform a multioutput classification with a Random Forest model. from sklearn.datasets import make_classification from sklearn.multioutput import MultiOutputClassifier from sklearn.ensemble…
Harpal
  • 903
  • 1
  • 7
  • 13
48
votes
3 answers

StandardScaler before or after splitting data - which is better?

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were…
tsumaranaina
  • 695
  • 1
  • 6
  • 17
46
votes
6 answers

Calculating KL Divergence in Python

I am rather new to this and can't say I have a complete understanding of the theoretical concepts behind this. I am trying to calculate the KL Divergence between several lists of points in Python. I am using this to try and do this. The problem that…
Nanda
  • 773
  • 1
  • 7
  • 8
46
votes
5 answers

How to force weights to be non-negative in Linear regression

I am using a standard linear regression using scikit-learn in python. However, I would like to force the weights to be all non-negative for every feature. is there any way I can accomplish that? I was looking in the documentation but could not find…
user
  • 1,971
  • 6
  • 20
  • 36
36
votes
6 answers

Sentence similarity prediction

I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like: New…
lte__
  • 1,310
  • 5
  • 18
  • 26
36
votes
1 answer

Why is xgboost so much faster than sklearn GradientBoostingClassifier?

I'm trying to train a gradient boosting model over 50k examples with 100 numeric features. XGBClassifier handles 500 trees within 43 seconds on my machine, while GradientBoostingClassifier handles only 10 trees(!) in 1 minutes and 2 seconds :( I…
ihadanny
  • 1,357
  • 2
  • 11
  • 19
1
2 3
99 100