Questions tagged [scikit-learn]

What is scikit-learn?

scikit-learn is a popular machine learning package for Python that has simple and efficient tools for predictive data analysis. Topics include classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. It is built upon NumPy, SciPy, and matplotlib and is open-sourced under the BSD License. It is part of the scientific computation ecosystem and useful for both individual and commercial use.

New to scikit-learn?

There are various resources including books, tutorials/workshops, etc. for those looking to learn how to use scikit-learn.

A popular introductory tutorial is:

SciPy 2018 Conference Tutorial:

A popular introductory book is:

Introduction to Machine Learning with Python, by Andreas C. Müller and Sarah Guido.

scikit-learn Tag usage

When posting questions about scikit-learn, please take the following into consideration:

When tagging questions with the scikit-learn tag, users should not use the tag sklearn, despite semantic similarity, as the latter is marked as a synonym and will automatically be retagged.
Explicit programming related questions are more suitable for Stack Overflow and should not be posted on Stack Exchange Data Science.
Questions should include sufficient details and clarity to be able to provide support for the problem at hand. This includes linking to underlying data used, providing code used for the model's construction, highlighting relevant outputs, etc.

External Resources

scikit-learn: Documentation page

scikit-learn: GitHub page

Important links

HTML documentation (development version): http://scikit-learn.org/dev/
Download releases: http://sourceforge.net/projects/scikit-learn/files/
Issue tracker: https://github.com/scikit-learn/scikit-learn/issues
Mailing list: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

2291 questions

240

votes

10 answers

What's the difference between fit and fit_transform in scikit-learn models?

I do not understand the difference between the fit and fit_transform methods in scikit-learn. Can anybody explain simply why we might need to transform data? What does it mean, fitting a model on training data and transforming to test data? Does it…

python scikit-learn

asked Jun 21 '16 at 10:05

Kaggle

2,877
5
13
8

191

votes

16 answers

Train/Test/Validation Set Splitting in Sklearn

How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn? As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not…

machine-learning scikit-learn cross-validation

asked Nov 15 '16 at 14:55

Hendrik

8,377
17
40
55

172

votes

4 answers

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

I have been building models with categorical data for a while now and when in this situation I basically default to using scikit-learn's LabelEncoder function to transform this data prior to building a model. I understand the difference between OHE,…

scikit-learn categorical-data feature-engineering

asked Dec 19 '15 at 19:30

anthr

1,843
3
11
11

114

votes

12 answers

SVM using scikit learn runs endlessly and never completes execution

I am trying to run SVR using scikit-learn (python) on a training dataset that has 595605 rows and 5 columns (features) while the test dataset has 397070 rows. The data has been pre-processed and regularized. I am able to successfully run the test…

python svm scikit-learn

asked Aug 18 '14 at 10:46

tejaskhot

3,935
7
20
18

votes

10 answers

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I got ValueError when predicting test data using a RandomForest model. My code: clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2) clf.fit(X_fit, y_fit) df_test.fillna(df_test.mean()) X_test = df_test.values y_pred =…

python scikit-learn pandas random-forest python-3.x

asked May 26 '16 at 04:13

Edamame

2,705
5
23
32

votes

6 answers

strings as features in decision tree/random forest

I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I…

machine-learning python scikit-learn random-forest decision-trees

asked Feb 25 '15 at 01:07

user3001408

1,005
1
10
8

votes

8 answers

Does scikit-learn have a forward selection/stepwise regression algorithm?

I am working on a problem with too many features and training my models takes way too long. I implemented a forward selection algorithm to choose features. However, I was wondering does scikit-learn have a forward selection/stepwise regression…

feature-selection scikit-learn

asked Aug 07 '14 at 15:33

Maksud

votes

4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…

machine-learning python scikit-learn preprocessing encoding

asked Oct 07 '18 at 18:55

Saurabh Singh

votes

2 answers

train_test_split() error: Found input variables with inconsistent numbers of samples

Fairly new to Python but building out my first RF model based on some classification data. I've converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the…

python scikit-learn sampling

asked Jul 06 '17 at 05:17

josh_gray

votes

3 answers

Understanding predict_proba from MultiOutputClassifier

I'm following this example on the scikit-learn website to perform a multioutput classification with a Random Forest model. from sklearn.datasets import make_classification from sklearn.multioutput import MultiOutputClassifier from sklearn.ensemble…

scikit-learn random-forest multilabel-classification

asked Sep 01 '17 at 10:57

Harpal

votes

3 answers

StandardScaler before or after splitting data - which is better?

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were…

machine-learning scikit-learn preprocessing

asked Sep 18 '18 at 02:35

tsumaranaina

votes

6 answers

Calculating KL Divergence in Python

I am rather new to this and can't say I have a complete understanding of the theoretical concepts behind this. I am trying to calculate the KL Divergence between several lists of points in Python. I am using this to try and do this. The problem that…

python clustering scikit-learn

asked Dec 08 '15 at 10:37

Nanda

votes

5 answers

How to force weights to be non-negative in Linear regression

I am using a standard linear regression using scikit-learn in python. However, I would like to force the weights to be all non-negative for every feature. is there any way I can accomplish that? I was looking in the documentation but could not find…

python scikit-learn linear-regression

asked Apr 11 '17 at 03:02

user

1,971
6
20
36

votes

6 answers

Sentence similarity prediction

I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like: New…

python nlp scikit-learn similarity text

asked Oct 22 '17 at 07:36

lte__

1,310
5
18
26

votes

1 answer

Why is xgboost so much faster than sklearn GradientBoostingClassifier?

I'm trying to train a gradient boosting model over 50k examples with 100 numeric features. XGBClassifier handles 500 trees within 43 seconds on my machine, while GradientBoostingClassifier handles only 10 trees(!) in 1 minutes and 2 seconds :( I…

scikit-learn xgboost gbm

asked Mar 29 '16 at 14:14

ihadanny

1,357
2
11
19

2 3

…

99 100 Next