Highest Voted 'data' Questions - Data Science Stack Exchange

46

votes

2 answers

How does the validation_split parameter of Keras' fit function work?

Validation-split in Keras Sequential model fit function is documented as following on https://keras.io/models/sequential/ : validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set…

keras data cross-validation

asked Sep 30 '18 at 06:30

rnso

1,558
3
16
34

35

votes

9 answers

Why is it wrong to train and test a model on the same dataset?

What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?

machine-learning neural-network dataset data training

asked Dec 13 '20 at 14:11

karalis1

461
1
5
8

30

votes

1 answer

How is a splitting point chosen for continuous variables in decision trees?

I have two questions related to decision trees: If we have a continuous attribute, how do we choose the splitting value? Example: Age=(20,29,50,40....) Imagine that we have a continuous attribute $f$ that have values in $R$. How can I write an…

classification data decision-trees

asked Nov 03 '17 at 21:45

WALID BELRHALMIA

411
1
4
5

30

votes

4 answers

Is pandas now faster than data.table?

Here is the GitHub link to the most recent data.table benchmark. The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used…

python r pandas data data-table

asked Oct 25 '17 at 02:43

xiaodai

620
1
5
12

20

votes

5 answers

Do modern R and/or Python libraries make SQL obsolete?

I work in an office where SQL Server is the backbone of everything we do, from data processing to cleaning to munging. My colleague specializes in writing complex functions and stored procedures to methodically process incoming data so that it can…

python r data-cleaning data sql

asked Feb 24 '17 at 19:33

AffableAmbler

363
1
2
10

18

votes

7 answers

Interactive labeling/annotating of time series data

I have a data set of time series data. I'm looking for an annotation (or labeling) tool to visualize it and to be able to interactively add labels on it, in order to get annotated data that I can use for supervised ML. E.g. the input data is a…

machine-learning python data labels

asked Sep 11 '18 at 06:19

mibrl12

283
1
2
5

16

votes

2 answers

How much data are sufficient to train my machine learning model?

I've been working on machine learning and bioinformatics for a while, and today I had a conversation with a colleague about the main general issues of data mining. My colleague (who is a machine learning expert) said that, in his opinion, the…

machine-learning data-mining dataset data-cleaning data

asked Jun 26 '17 at 21:26

DavideChicco.it

281
1
3
7

13

votes

3 answers

How to create US state choropleth map

I have a value associated with each US state (let's pretend it's the average temperature in January for each state). I want to display this data as a heat map of the United States. To be clear, it would be a map of the US with each state having a…

data

asked Jan 04 '16 at 18:35

user15180

131
1
1
3

13

votes

1 answer

Do I have to standardize my new polynomial features?

I have a vector X with n features previously standardized. If I want to generate new polynomial features (let say adding square features), do I need to do another standardization on these new features after the computing ? Because knowing that my…

machine-learning dataset data-cleaning data

asked Nov 25 '15 at 11:11

jmvllt

619
1
8
15

13

votes

4 answers

Interpreting Decision Tree in context of feature importances

I'm trying to understand how to fully understand the decision process of a decision tree classification model built with sklearn. The 2 main aspect I'm looking at are a graphviz representation of the tree and the list of feature importances. What I…

machine-learning visualization scikit-learn data decision-trees

asked Feb 02 '17 at 00:29

Tim Lindsey

245
1
2
4

11

votes

2 answers

Oversampling/Undersampling only train set only or both train and validation set

I am working on a dataset with class imbalance problem. Now, I know one needs to oversample or undersample only the train set and not the test set. But my issue is: whether to oversample the train set and then split it to train and validate set or…

data training smote

asked Oct 17 '19 at 08:21

yamini goel

711
3
7
14

11

votes

2 answers

How to perform Logistic Regression with a large number of features?

I have a dataset with 330 samples and 27 features for each sample, with a binary class problem for Logistic Regression. According to the "rule if ten" I need at least 10 events for each feature to be included. Though, I have an imbalanced dataset,…

machine-learning python predictive-modeling logistic-regression data

asked Jul 28 '17 at 09:32

LucasRamos

111
1
1
3

10

votes

6 answers

What are some of the best practices for sharing data and models with colleagues?

As a data scientist who recently joined a new team, I wanted to ask the community how they share data and models among their colleagues. Currently I have to resort to storing data in some central server or location where all of us can access (which…

machine-learning predictive-modeling dataset data model-selection

asked Mar 17 '17 at 18:45

asampat3090

81
1
6

9

votes

1 answer

What are the most suitable machine learning algorithms according to type of data?

I am beginner to data science. I found that some machine learning algorithms perform better, when given particular kind of data(ie - numerical, categorical, text, graphical). I searched about this topic on the web, but no luck. I would like to know…

machine-learning algorithms data

asked Jun 23 '17 at 02:09

user158

211
1
2
4

8

votes

5 answers

Tool to Generate 2D Data via Mouse Clicking

Often when I am learning new machine learning methods or experimenting with a data analysis algorithm I need to generate a series of 2D points. Teachers also do this often when making a lesson or tutorial. In some cases I just create a function, add…

data tools

asked Oct 27 '15 at 17:16

MD004

290
1
3
9

Questions tagged [data]