Highest Voted Questions - Data Science Stack Exchange

37

votes

2 answers

How to use the output of GridSearch?

I'm currently working with Python and Scikit learn for classification purposes, and doing some reading around GridSearch I thought this was a great way for optimising my estimator parameters to get the best results. My methodology is this: Split my…

machine-learning cross-validation

asked Aug 01 '17 at 13:20

Dan Carter

1,732
1
11
26

37

votes

1 answer

RNN's with multiple features

I have a bit of self taught knowledge working with Machine Learning algorithms (the basic Random Forest and Linear Regression type stuff). I decided to branch out and begin learning RNN's with Keras. When looking at most of the examples, which…

machine-learning neural-network keras

asked Feb 16 '17 at 19:35

Rjay155

1,205
2
12
9

37

votes

4 answers

Do Random Forest overfit?

I have been reading around about Random Forests but I cannot really find a definitive answer about the problem of overfitting. According to the original paper of Breiman, they should not overfit when increasing the number of trees in the forest, but…

machine-learning random-forest

asked Aug 23 '14 at 16:54

papafe

585
1
5
9

36

votes

5 answers

Is it necessary to standardize your data before clustering?

Is it necessary to standardize your data before cluster? In the example from scikit learn about DBSCAN, here they do this in the line: X = StandardScaler().fit_transform(X) But I do not understand why it is necessary. After all, clustering does…

python clustering anomaly-detection

asked Aug 06 '15 at 20:58

makansij

809
2
11
15

36

votes

4 answers

What is a good way to transform Cyclic Ordinal attributes?

I am having 'hour' field as my attribute, but it takes a cyclic values. How could I transform the feature to preserve the information like '23' and '0' hour are close not far. One way I could think is to do transformation: min(h, 23-h) Input: [0 1…

feature-extraction feature-scaling featurization

asked Jun 03 '15 at 05:56

Mangat Rai Modi

569
1
5
10

36

votes

3 answers

How to disable GPU with TensorFlow?

Using tensorflow-gpu 2.0.0rc0. I want to choose whether it uses the GPU or the CPU.

tensorflow gpu

asked Sep 07 '19 at 21:14

Florin Andrei

1,080
1
9
13

36

votes

7 answers

Organized processes to clean data

From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis. Are there any best practices or processes for cleaning data before processing it? If so, are there any…

r data-cleaning

asked May 14 '14 at 15:25

Jay Godse

461
5
7

36

votes

6 answers

Sentence similarity prediction

I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like: New…

python nlp scikit-learn similarity text

asked Oct 22 '17 at 07:36

lte__

1,310
5
18
26

36

votes

1 answer

Paper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?

So, recently there's a Layer Normalization paper. There's also an implementation of it on Keras. But I remember there are papers titled Recurrent Batch Normalization (Cooijmans, 2016) and Batch Normalized Recurrent Neural Networks (Laurent, 2015).…

deep-learning rnn normalization batch-normalization

asked Jul 23 '16 at 09:46

Rizky Luthfianto

2,176
2
19
22

36

votes

6 answers

How to do SVD and PCA with big data?

I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large…

bigdata data-mining dimensionality-reduction

asked Sep 25 '14 at 08:40

David S.

547
2
6
8

36

votes

1 answer

Why is xgboost so much faster than sklearn GradientBoostingClassifier?

I'm trying to train a gradient boosting model over 50k examples with 100 numeric features. XGBClassifier handles 500 trees within 43 seconds on my machine, while GradientBoostingClassifier handles only 10 trees(!) in 1 minutes and 2 seconds :( I…

scikit-learn xgboost gbm

asked Mar 29 '16 at 14:14

ihadanny

1,357
2
11
19

35

votes

3 answers

xgboost: give more importance to recent samples

Is there a way to add more importance to points which are more recent when analyzing data with xgboost?

xgboost weighted-data

asked Dec 22 '15 at 17:19

kilojoules

453
1
4
6

35

votes

9 answers

Why is it wrong to train and test a model on the same dataset?

What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?

machine-learning neural-network dataset data training

asked Dec 13 '20 at 14:11

karalis1

461
1
5
8

35

votes

3 answers

Does modeling with Random Forests require cross-validation?

As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the…

random-forest cross-validation

asked Jul 20 '15 at 13:42

neuron

664
1
6
9

35

votes

5 answers

What to set in steps_per_epoch in Keras' fit_generator?

I am replicating, in Keras, the work of a paper where I know the values of epoch and batch_size. Since the dataset is quite large, I am using fit_generator. I would like to know what to set in steps_per_epoch given epoch value and batch_size. Is…

keras epochs

asked Mar 16 '19 at 10:25

yamini goel

711
3
7
14

Most Popular