Most Popular
1500 questions
55
votes
9 answers
Is the R language suitable for Big Data
R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".
I've seen a guideline of 5TB for a…
akellyirl
- 723
- 1
- 6
- 9
55
votes
5 answers
Is it always better to use the whole dataset to train the final model?
A common technique after training, validating and testing the Machine Learning model of preference is to use the complete dataset, including the testing subset, to train a final model to deploy it on, e.g. a product.
My question is: Is it always…
pcko1
- 3,910
- 1
- 14
- 29
55
votes
2 answers
How to interpret the output of XGBoost importance?
I ran a xgboost model. I don't exactly know how to interpret the output of xgb.importance.
What is the meaning of Gain, Cover, and Frequency and how do we interpret them?
Also, what does Split, RealCover, and RealCover% mean? I have some extra…
user14204
54
votes
5 answers
How do subsequent convolution layers work?
This question boils down to "how do convolution layers exactly work.
Suppose I have an $n \times m$ greyscale image. So the image has one channel.
In the first layer, I apply a $3\times 3$ convolution with $k_1$ filters and padding. Then I have…
Martin Thoma
- 18,630
- 31
- 92
- 167
54
votes
4 answers
What is the advantage of keeping batch size a power of 2?
While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.
This answer claims that for some packages,…
James Bond
- 1,155
- 2
- 11
- 12
53
votes
9 answers
How do I compare columns in different data frames?
I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.
a_a_a
- 817
- 2
- 8
- 11
53
votes
2 answers
train_test_split() error: Found input variables with inconsistent numbers of samples
Fairly new to Python but building out my first RF model based on some classification data. I've converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the…
josh_gray
- 633
- 1
- 5
- 4
52
votes
3 answers
What is the difference between bootstrapping and cross-validation?
I used to apply K-fold cross-validation for robust evaluation of my machine learning models. But I'm aware of the existence of the bootstrapping method for this purpose as well. However, I cannot see the main difference between them in terms of…
Fredrik
- 967
- 2
- 9
- 11
51
votes
7 answers
Deep Learning vs gradient boosting: When to use what?
I have a big data problem with a large dataset (take for example 50 million rows and 200 columns). The dataset consists of about 100 numerical columns and 100 categorical columns and a response column that represents a binary class problem. The…
Nitesh
- 1,615
- 1
- 12
- 22
50
votes
2 answers
Why not always use the ADAM optimization technique?
It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets.
Why not always use Adam? Why even bother using…
PyRsquared
- 1,584
- 1
- 10
- 17
50
votes
3 answers
Understanding predict_proba from MultiOutputClassifier
I'm following this example on the scikit-learn website to perform a multioutput classification with a Random Forest model.
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble…
Harpal
- 903
- 1
- 7
- 13
49
votes
2 answers
What loss function to use for imbalanced classes (using PyTorch)?
I have a dataset with 3 classes with the following items:
Class 1: 900 elements
Class 2: 15000 elements
Class 3: 800 elements
I need to predict class 1 and class 3, which signal important deviations from the norm. Class 2 is the default “normal”…
Muppet
- 777
- 1
- 7
- 13
49
votes
3 answers
What is "experience replay" and what are its benefits?
I've been reading Google's DeepMind Atari paper and I'm trying to understand the concept of "experience replay". Experience replay comes up in a lot of other reinforcement learning papers (particularly, the AlphaGo paper), so I want to understand…
Ryan Zotti
- 4,129
- 3
- 19
- 32
49
votes
7 answers
What is the difference between model hyperparameters and model parameters?
I have noticed that such terms as model hyperparameter and model parameter have been used interchangeably on the web without prior clarification. I think this is incorrect and needs explanation. Consider a machine learning model, an SVM/NN/NB based…
minerals
- 2,137
- 3
- 17
- 19
48
votes
3 answers
StandardScaler before or after splitting data - which is better?
When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were…
tsumaranaina
- 695
- 1
- 6
- 17