Most Popular

1500 questions
8
votes
2 answers

What are the best practices to anonymize user names in data?

I'm working on a project which asks fellow students to share their original text data for further analysis using data mining techniques, and, I think it would be appropriate to anonymize student names with their submissions. Setting aside the…
xtian
  • 193
  • 1
  • 7
8
votes
1 answer

Does MLPClassifier (sklearn) support different activations for different layers?

According to the documentation, it says the 'activation' argument specifies: "Activation function for the hidden layer" Does that mean that you cannot use a different activation function in different layers?
DeLorean88
  • 215
  • 2
  • 4
8
votes
3 answers

Chi-square as evaluation metrics for nonlinear machine learning regression models

I am using machine learning models to predict an ordinal variable (values: 1,2,3,4, and 5) using 7 different features. I posed this as a regression problem, so the final outputs of a model are continuous variables. So an evaluation box plot looks…
Alex
  • 181
  • 2
8
votes
1 answer

Resume Parsing - extracting skills from resume using Machine Learning

I am trying to extract a skill set of an employee from his/her resume. I have resumes stored as plain text in Database. I do not have predefined skills in this case. How should I approach this problem? I can think of two ways: Using unsupervised…
Sociopath
  • 1,223
  • 2
  • 11
  • 27
8
votes
1 answer

How to learn 3D orientations reliably?

I am working on neural network models for 3D skeletal character animation, where I learn joint positions and orientations. The problem comes with the orientations. There are several ways I can choose to represent a 3D rotation, but all of them have…
jdehesa
  • 274
  • 1
  • 8
8
votes
1 answer

What is the best performance metric used in balancing dataset using SMOTE technique

I used smote technique to oversample my dataset and now I have a balanced dataset. The problem I faced is that the performance metrics; precision, recall, f1 measure, accuracy in the imbalanced dataset are better performed than with balanced…
Rawia Sammout
  • 199
  • 1
  • 3
  • 16
8
votes
2 answers

Fill missing values AND normalise

I have two columns of training data for a neural net which are missing values. (There are many other columns which aren't missing values.) For example Height | Weight 180 | 70 175 | N/A N/A | N/A I want to fill missing values and…
joel
  • 180
  • 1
  • 5
8
votes
1 answer

It is helpful to normalize target variables for a regression neural network?

It is customary to normalize feature variables and this normally does increase the performance of a neural network in particular a CNN. I was wondering if normalizing the target could also help increase performance? I did not notice an increase in…
Tank
  • 287
  • 1
  • 2
  • 9
8
votes
4 answers

How to download a Jupyter Notebook from GitHub?

This is a fairly basic question. I am working on a data science project inside of a Pandas tutorial. I can access my Jupyter notebooks through my Anaconda installation. The only problem is that the tutorial notebooks (exercise files) are on…
Ethan
  • 1,625
  • 8
  • 23
  • 39
8
votes
1 answer

XGBoost: Quantifying Feature Importances

I need to quantify the importance of the features in my model. However, when I use XGBoost to do this, I get completely different results depending on whether I use the variable importance plot or the feature importances. For example, if I use…
NLR
  • 181
  • 1
  • 1
  • 2
8
votes
1 answer

Using an autoencoder for anomaly detection on categorical data

Say a dataset has 0.5% of its features continuous and 99.5% categorical (binary) with ~2400 features in total. In this dataset, each observation is 1 of 2 classes - Fraud (1) or Not Fraud (0). Furthermore, there is a large class imbalance with only…
PyRsquared
  • 1,584
  • 1
  • 10
  • 17
8
votes
3 answers

Where can I find freely available multi-label datasets online?

I'm trying to find multi-label classfication datasets, which are available for free online. By "multi-label" I mean that each instance can be labeled with anywhere from a single to $k$ labels, where $k$ is the total number of different labels in…
Bobson Dugnutt
  • 185
  • 1
  • 8
8
votes
1 answer

Text extraction from documents using NLP or Deep Learning

I am looking for references(Papers/github projects) on how to use deep learning in a text extraction task. Recently I was given a task to extract important information from documents of similar type, say for example legal merger documents. I have…
8
votes
2 answers

Audio Analysis : Segment audio based on speaker recognition

I have audio clips of people being interviewed and am trying to split the audio clips using python such that all speech segments of the interviewee are outputted in one audio file (eg .wav format) & that of the interviewer in another audio file.…
aamir23
  • 181
  • 1
  • 4
8
votes
1 answer

Categorization of approaches to deal with imbalanced classes

What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning:…