Highest Voted Questions - Data Science Stack Exchange

8

votes

2 answers

What are the best practices to anonymize user names in data?

I'm working on a project which asks fellow students to share their original text data for further analysis using data mining techniques, and, I think it would be appropriate to anonymize student names with their submissions. Setting aside the…

machine-learning data-cleaning

asked Dec 12 '14 at 03:00

xtian

193
1
7

8

votes

1 answer

Does MLPClassifier (sklearn) support different activations for different layers?

According to the documentation, it says the 'activation' argument specifies: "Activation function for the hidden layer" Does that mean that you cannot use a different activation function in different layers?

machine-learning neural-network scikit-learn

asked Aug 09 '18 at 17:22

DeLorean88

215
2
4

8

votes

3 answers

Chi-square as evaluation metrics for nonlinear machine learning regression models

I am using machine learning models to predict an ordinal variable (values: 1,2,3,4, and 5) using 7 different features. I posed this as a regression problem, so the final outputs of a model are continuous variables. So an evaluation box plot looks…

machine-learning-model model-evaluations metric

asked Aug 06 '18 at 18:08

Alex

181
2

8

votes

1 answer

Resume Parsing - extracting skills from resume using Machine Learning

I am trying to extract a skill set of an employee from his/her resume. I have resumes stored as plain text in Database. I do not have predefined skills in this case. How should I approach this problem? I can think of two ways: Using unsupervised…

machine-learning python text-mining topic-model

asked Aug 04 '18 at 05:27

Sociopath

1,223
2
11
27

8

votes

1 answer

How to learn 3D orientations reliably?

I am working on neural network models for 3D skeletal character animation, where I learn joint positions and orientations. The problem comes with the orientations. There are several ways I can choose to represent a 3D rotation, but all of them have…

machine-learning neural-network

asked Aug 02 '18 at 13:14

jdehesa

274
1
8

8

votes

1 answer

What is the best performance metric used in balancing dataset using SMOTE technique

I used smote technique to oversample my dataset and now I have a balanced dataset. The problem I faced is that the performance metrics; precision, recall, f1 measure, accuracy in the imbalanced dataset are better performed than with balanced…

python class-imbalance performance smote

asked Jul 31 '18 at 23:23

Rawia Sammout

199
1
3
16

8

votes

2 answers

Fill missing values AND normalise

I have two columns of training data for a neural net which are missing values. (There are many other columns which aren't missing values.) For example Height | Weight 180 | 70 175 | N/A N/A | N/A I want to fill missing values and…

keras pandas normalization missing-data numpy

asked Jul 26 '18 at 11:54

joel

180
1
5

8

votes

1 answer

It is helpful to normalize target variables for a regression neural network?

It is customary to normalize feature variables and this normally does increase the performance of a neural network in particular a CNN. I was wondering if normalizing the target could also help increase performance? I did not notice an increase in…

neural-network convolutional-neural-network normalization

asked Jul 17 '18 at 15:22

Tank

287
1
2
9

8

votes

4 answers

How to download a Jupyter Notebook from GitHub?

This is a fairly basic question. I am working on a data science project inside of a Pandas tutorial. I can access my Jupyter notebooks through my Anaconda installation. The only problem is that the tutorial notebooks (exercise files) are on…

python pandas jupyter ipython

asked Jul 16 '18 at 19:59

Ethan

1,625
8
23
39

8

votes

1 answer

XGBoost: Quantifying Feature Importances

I need to quantify the importance of the features in my model. However, when I use XGBoost to do this, I get completely different results depending on whether I use the variable importance plot or the feature importances. For example, if I use…

python xgboost predictor-importance

asked Jul 09 '18 at 17:30

NLR

181
1
1
2

8

votes

1 answer

Using an autoencoder for anomaly detection on categorical data

Say a dataset has 0.5% of its features continuous and 99.5% categorical (binary) with ~2400 features in total. In this dataset, each observation is 1 of 2 classes - Fraud (1) or Not Fraud (0). Furthermore, there is a large class imbalance with only…

neural-network anomaly-detection autoencoder

asked Jul 09 '18 at 15:53

PyRsquared

1,584
1
10
17

8

votes

3 answers

Where can I find freely available multi-label datasets online?

I'm trying to find multi-label classfication datasets, which are available for free online. By "multi-label" I mean that each instance can be labeled with anywhere from a single to $k$ labels, where $k$ is the total number of different labels in…

dataset multilabel-classification

asked Jul 01 '18 at 22:50

Bobson Dugnutt

185
1
8

8

votes

1 answer

Text extraction from documents using NLP or Deep Learning

I am looking for references(Papers/github projects) on how to use deep learning in a text extraction task. Recently I was given a task to extract important information from documents of similar type, say for example legal merger documents. I have…

deep-learning nlp text-mining reinforcement-learning named-entity-recognition

asked Jun 19 '18 at 16:09

Phaneeth

95
1
1
3

8

votes

2 answers

Audio Analysis : Segment audio based on speaker recognition

I have audio clips of people being interviewed and am trying to split the audio clips using python such that all speech segments of the interviewee are outputted in one audio file (eg .wav format) & that of the interviewer in another audio file.…

python data-cleaning audio-recognition

asked Jun 18 '18 at 00:50

aamir23

181
1
4

8

votes

1 answer

Categorization of approaches to deal with imbalanced classes

What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning:…

machine-learning classification class-imbalance imbalance imbalanced-data

asked Jun 08 '18 at 05:10

ebrahimi

1,277
7
20
39

Most Popular