Questions mostly concerned with managing data, without focus on pre-processing or modelling.
Questions tagged [data]
865 questions
46
votes
2 answers
How does the validation_split parameter of Keras' fit function work?
Validation-split in Keras Sequential model fit function is documented as following on https://keras.io/models/sequential/ :
validation_split: Float between 0 and 1. Fraction of the training data
to be used as validation data. The model will set…
rnso
- 1,558
- 3
- 16
- 34
35
votes
9 answers
Why is it wrong to train and test a model on the same dataset?
What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?
karalis1
- 461
- 1
- 5
- 8
30
votes
1 answer
How is a splitting point chosen for continuous variables in decision trees?
I have two questions related to decision trees:
If we have a continuous attribute, how do we choose the splitting value?
Example: Age=(20,29,50,40....)
Imagine that we have a continuous attribute $f$ that have values in $R$. How can I write an…
WALID BELRHALMIA
- 411
- 1
- 4
- 5
30
votes
4 answers
Is pandas now faster than data.table?
Here is the GitHub link to the most recent data.table benchmark.
The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used…
xiaodai
- 620
- 1
- 5
- 12
20
votes
5 answers
Do modern R and/or Python libraries make SQL obsolete?
I work in an office where SQL Server is the backbone of everything we do, from data processing to cleaning to munging. My colleague specializes in writing complex functions and stored procedures to methodically process incoming data so that it can…
AffableAmbler
- 363
- 1
- 2
- 10
18
votes
7 answers
Interactive labeling/annotating of time series data
I have a data set of time series data. I'm looking for an annotation (or labeling) tool to visualize it and to be able to interactively add labels on it, in order to get annotated data that I can use for supervised ML.
E.g. the input data is a…
mibrl12
- 283
- 1
- 2
- 5
16
votes
2 answers
How much data are sufficient to train my machine learning model?
I've been working on machine learning and bioinformatics for a while, and today I had a conversation with a colleague about the main general issues of data mining.
My colleague (who is a machine learning expert) said that, in his opinion, the…
DavideChicco.it
- 281
- 1
- 3
- 7
13
votes
3 answers
How to create US state choropleth map
I have a value associated with each US state (let's pretend it's the average temperature in January for each state). I want to display this data as a heat map of the United States. To be clear, it would be a map of the US with each state having a…
user15180
- 131
- 1
- 1
- 3
13
votes
1 answer
Do I have to standardize my new polynomial features?
I have a vector X with n features previously standardized.
If I want to generate new polynomial features (let say adding square features), do I need to do another standardization on these new features after the computing ?
Because knowing that my…
jmvllt
- 619
- 1
- 8
- 15
13
votes
4 answers
Interpreting Decision Tree in context of feature importances
I'm trying to understand how to fully understand the decision process of a decision tree classification model built with sklearn. The 2 main aspect I'm looking at are a graphviz representation of the tree and the list of feature importances. What I…
Tim Lindsey
- 245
- 1
- 2
- 4
11
votes
2 answers
Oversampling/Undersampling only train set only or both train and validation set
I am working on a dataset with class imbalance problem. Now, I know one needs to oversample or undersample only the train set and not the test set. But my issue is: whether to oversample the train set and then split it to train and validate set or…
yamini goel
- 711
- 3
- 7
- 14
11
votes
2 answers
How to perform Logistic Regression with a large number of features?
I have a dataset with 330 samples and 27 features for each sample, with a binary class problem for Logistic Regression.
According to the "rule if ten" I need at least 10 events for each feature to be included. Though, I have an imbalanced dataset,…
LucasRamos
- 111
- 1
- 1
- 3
10
votes
6 answers
What are some of the best practices for sharing data and models with colleagues?
As a data scientist who recently joined a new team, I wanted to ask the community how they share data and models among their colleagues. Currently I have to resort to storing data in some central server or location where all of us can access (which…
asampat3090
- 81
- 1
- 6
9
votes
1 answer
What are the most suitable machine learning algorithms according to type of data?
I am beginner to data science. I found that some machine learning algorithms perform better, when given particular kind of data(ie - numerical, categorical, text, graphical).
I searched about this topic on the web, but no luck.
I would like to know…
user158
- 211
- 1
- 2
- 4
8
votes
5 answers
Tool to Generate 2D Data via Mouse Clicking
Often when I am learning new machine learning methods or experimenting with a data analysis algorithm I need to generate a series of 2D points. Teachers also do this often when making a lesson or tutorial.
In some cases I just create a function, add…
MD004
- 290
- 1
- 3
- 9