Questions tagged [data-cleaning]

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software.

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software. Exploratory data analysis techniques are often used to identify problems.

757 questions

votes

6 answers

How can I transform names in a confidential data set to make it anonymous, but preserve some of the characteristics of the names?

Motivation I work with datasets that contain personally identifiable information (PII) and sometimes need to share part of a dataset with third parties, in a way that doesn't expose PII and subject my employer to liability. Our usual approach here…

data-cleaning anonymization

asked Jun 16 '14 at 19:48

Air

votes

7 answers

Organized processes to clean data

From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis. Are there any best practices or processes for cleaning data before processing it? If so, are there any…

r data-cleaning

asked May 14 '14 at 15:25

Jay Godse

votes

3 answers

General approach to extract key text from sentence (nlp)

Given a sentence like: Complimentary gym access for two for the length of stay ($12 value per person per day) What general approach can I take to identify the word gym or gym access?

machine-learning nlp text-mining data-cleaning

asked Mar 13 '15 at 16:41

William Falcon

votes

2 answers

Removing strings after a certain character in a given text

I have a dataset like the one below. I would like to remove all characters after the character ©. How can I do that in R? data_clean_phrase <- c("Copyright © The Society of Geomagnetism and Earth", "© 2013 Chinese National Committee…

r data-cleaning

asked Nov 19 '15 at 12:59

Hamideh

votes

4 answers

Is there any data tidying tool for python/pandas similar to R tidyr tool?

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of gather(), separate() and spread(), which can be found in R tidyr tool.

r python dataset data-cleaning pandas

asked Mar 02 '16 at 08:54

cpumar

votes

5 answers

How to annotate text documents with meta-data?

Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document: I saw the company's manager last day. To be able to extract…

nlp metadata data-cleaning text-mining

asked May 29 '14 at 20:11

Amir Ali Akbari

1,393
3
13
25

votes

3 answers

When to use Standard Scaler and when Normalizer?

I understand what Standard Scalar does and what Normalizer does, per the scikit documentation: Normalizer, Standard Scaler. I know when Standard Scaler is applied. But in which scenario is Normalizer applied? Are there scenarios where one is…

python scikit-learn data-cleaning normalization

asked Feb 20 '19 at 16:38

Heisenbug

votes

2 answers

Convert a pandas column of int to timestamp datatype

I have a dataframe that among other things, contains a column of the number of milliseconds passed since 1970-1-1. I need to convert this column of ints to timestamp data, so I can then ultimately convert it to a column of datetime data by adding…

python time-series data-cleaning pandas

asked Oct 19 '16 at 21:22

Austin Capobianco

votes

5 answers

Do modern R and/or Python libraries make SQL obsolete?

I work in an office where SQL Server is the backbone of everything we do, from data processing to cleaning to munging. My colleague specializes in writing complex functions and stored procedures to methodically process incoming data so that it can…

python r data-cleaning data sql

asked Feb 24 '17 at 19:33

AffableAmbler

votes

2 answers

How much data are sufficient to train my machine learning model?

I've been working on machine learning and bioinformatics for a while, and today I had a conversation with a colleague about the main general issues of data mining. My colleague (who is a machine learning expert) said that, in his opinion, the…

machine-learning data-mining dataset data-cleaning data

asked Jun 26 '17 at 21:26

DavideChicco.it

votes

4 answers

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled. So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so…

text-mining data-cleaning

asked Mar 21 '16 at 12:01

Stéphanie C

votes

10 answers

How can I appropriately handle cleaning of gender data?

I’m a data science student and I’ve begun working with an open mental health dataset. As part of this, I need to clean the data so that I can perform an analysis of it. In this dataset, the gender field is a string that could have had anything…

machine-learning data-cleaning categorical-data

asked Mar 20 '20 at 04:23

nick012000

votes

1 answer

Do I have to standardize my new polynomial features?

I have a vector X with n features previously standardized. If I want to generate new polynomial features (let say adding square features), do I need to do another standardization on these new features after the computing ? Because knowing that my…

machine-learning dataset data-cleaning data

asked Nov 25 '15 at 11:11

jmvllt

votes

2 answers

Creating new columns by iterating over rows in pandas dataframe

I have a pandas data frame (X11) like this: In actual I have 99 columns up to dx99 dx1 dx2 dx3 dx4 0 25041 40391 5856 0 1 25041 40391 25081 5856 2 25041 40391 42822 0 3 25061 40391 0 0 4 25041 …

python data-cleaning pandas anaconda

asked Dec 07 '15 at 21:39

Sanoj

votes

5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…

machine-learning data-cleaning preprocessing data-imputation

asked Apr 06 '20 at 01:10

rocksNwaves

2 3

…

50 51 Next