Most Popular
1500 questions
37
votes
2 answers
How to use the output of GridSearch?
I'm currently working with Python and Scikit learn for classification purposes, and doing some reading around GridSearch I thought this was a great way for optimising my estimator parameters to get the best results.
My methodology is this:
Split my…
Dan Carter
- 1,732
- 1
- 11
- 26
37
votes
1 answer
RNN's with multiple features
I have a bit of self taught knowledge working with Machine Learning algorithms (the basic Random Forest and Linear Regression type stuff). I decided to branch out and begin learning RNN's with Keras. When looking at most of the examples, which…
Rjay155
- 1,205
- 2
- 12
- 9
37
votes
4 answers
Do Random Forest overfit?
I have been reading around about Random Forests but I cannot really find a definitive answer about the problem of overfitting. According to the original paper of Breiman, they should not overfit when increasing the number of trees in the forest, but…
papafe
- 585
- 1
- 5
- 9
36
votes
5 answers
Is it necessary to standardize your data before clustering?
Is it necessary to standardize your data before cluster? In the example from scikit learn about DBSCAN, here they do this in the line:
X = StandardScaler().fit_transform(X)
But I do not understand why it is necessary. After all, clustering does…
makansij
- 809
- 2
- 11
- 15
36
votes
4 answers
What is a good way to transform Cyclic Ordinal attributes?
I am having 'hour' field as my attribute, but it takes a cyclic values. How could I transform the feature to preserve the information like '23' and '0' hour are close not far.
One way I could think is to do transformation: min(h, 23-h)
Input: [0 1…
Mangat Rai Modi
- 569
- 1
- 5
- 10
36
votes
3 answers
How to disable GPU with TensorFlow?
Using tensorflow-gpu 2.0.0rc0. I want to choose whether it uses the GPU or the CPU.
Florin Andrei
- 1,080
- 1
- 9
- 13
36
votes
7 answers
Organized processes to clean data
From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis.
Are there any best practices or processes for cleaning data before processing it? If so, are there any…
Jay Godse
- 461
- 5
- 7
36
votes
6 answers
Sentence similarity prediction
I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like:
New…
lte__
- 1,310
- 5
- 18
- 26
36
votes
1 answer
Paper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?
So, recently there's a Layer Normalization paper. There's also an implementation of it on Keras.
But I remember there are papers titled Recurrent Batch Normalization (Cooijmans, 2016) and Batch Normalized Recurrent Neural Networks (Laurent, 2015).…
Rizky Luthfianto
- 2,176
- 2
- 19
- 22
36
votes
6 answers
How to do SVD and PCA with big data?
I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large…
David S.
- 547
- 2
- 6
- 8
36
votes
1 answer
Why is xgboost so much faster than sklearn GradientBoostingClassifier?
I'm trying to train a gradient boosting model over 50k examples with 100 numeric features. XGBClassifier handles 500 trees within 43 seconds on my machine, while GradientBoostingClassifier handles only 10 trees(!) in 1 minutes and 2 seconds :( I…
ihadanny
- 1,357
- 2
- 11
- 19
35
votes
3 answers
xgboost: give more importance to recent samples
Is there a way to add more importance to points which are more recent when analyzing data with xgboost?
kilojoules
- 453
- 1
- 4
- 6
35
votes
9 answers
Why is it wrong to train and test a model on the same dataset?
What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?
karalis1
- 461
- 1
- 5
- 8
35
votes
3 answers
Does modeling with Random Forests require cross-validation?
As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the…
neuron
- 664
- 1
- 6
- 9
35
votes
5 answers
What to set in steps_per_epoch in Keras' fit_generator?
I am replicating, in Keras, the work of a paper where I know the values of epoch and batch_size. Since the dataset is quite large, I am using fit_generator. I would like to know what to set in steps_per_epoch given epoch value and batch_size. Is…
yamini goel
- 711
- 3
- 7
- 14