Questions tagged [data-cleaning]

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software.

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software. Exploratory data analysis techniques are often used to identify problems.

757 questions
44
votes
6 answers

How can I transform names in a confidential data set to make it anonymous, but preserve some of the characteristics of the names?

Motivation I work with datasets that contain personally identifiable information (PII) and sometimes need to share part of a dataset with third parties, in a way that doesn't expose PII and subject my employer to liability. Our usual approach here…
Air
  • 822
  • 9
  • 20
36
votes
7 answers

Organized processes to clean data

From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis. Are there any best practices or processes for cleaning data before processing it? If so, are there any…
Jay Godse
  • 461
  • 5
  • 7
31
votes
3 answers

General approach to extract key text from sentence (nlp)

Given a sentence like: Complimentary gym access for two for the length of stay ($12 value per person per day) What general approach can I take to identify the word gym or gym access?
William Falcon
  • 421
  • 1
  • 6
  • 7
26
votes
2 answers

Removing strings after a certain character in a given text

I have a dataset like the one below. I would like to remove all characters after the character ©. How can I do that in R? data_clean_phrase <- c("Copyright © The Society of Geomagnetism and Earth", "© 2013 Chinese National Committee…
Hamideh
  • 920
  • 2
  • 11
  • 22
25
votes
4 answers

Is there any data tidying tool for python/pandas similar to R tidyr tool?

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of gather(), separate() and spread(), which can be found in R tidyr tool.
cpumar
  • 807
  • 1
  • 9
  • 14
23
votes
5 answers

How to annotate text documents with meta-data?

Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document: I saw the company's manager last day. To be able to extract…
Amir Ali Akbari
  • 1,393
  • 3
  • 13
  • 25
22
votes
3 answers

When to use Standard Scaler and when Normalizer?

I understand what Standard Scalar does and what Normalizer does, per the scikit documentation: Normalizer, Standard Scaler. I know when Standard Scaler is applied. But in which scenario is Normalizer applied? Are there scenarios where one is…
Heisenbug
  • 401
  • 1
  • 3
  • 6
22
votes
2 answers

Convert a pandas column of int to timestamp datatype

I have a dataframe that among other things, contains a column of the number of milliseconds passed since 1970-1-1. I need to convert this column of ints to timestamp data, so I can then ultimately convert it to a column of datetime data by adding…
Austin Capobianco
  • 483
  • 1
  • 4
  • 18
20
votes
5 answers

Do modern R and/or Python libraries make SQL obsolete?

I work in an office where SQL Server is the backbone of everything we do, from data processing to cleaning to munging. My colleague specializes in writing complex functions and stored procedures to methodically process incoming data so that it can…
AffableAmbler
  • 363
  • 1
  • 2
  • 10
16
votes
2 answers

How much data are sufficient to train my machine learning model?

I've been working on machine learning and bioinformatics for a while, and today I had a conversation with a colleague about the main general issues of data mining. My colleague (who is a machine learning expert) said that, in his opinion, the…
16
votes
4 answers

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled. So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so…
Stéphanie C
  • 281
  • 1
  • 2
  • 5
14
votes
10 answers

How can I appropriately handle cleaning of gender data?

I’m a data science student and I’ve begun working with an open mental health dataset. As part of this, I need to clean the data so that I can perform an analysis of it. In this dataset, the gender field is a string that could have had anything…
nick012000
  • 263
  • 2
  • 9
13
votes
1 answer

Do I have to standardize my new polynomial features?

I have a vector X with n features previously standardized. If I want to generate new polynomial features (let say adding square features), do I need to do another standardization on these new features after the computing ? Because knowing that my…
jmvllt
  • 619
  • 1
  • 8
  • 15
12
votes
2 answers

Creating new columns by iterating over rows in pandas dataframe

I have a pandas data frame (X11) like this: In actual I have 99 columns up to dx99 dx1 dx2 dx3 dx4 0 25041 40391 5856 0 1 25041 40391 25081 5856 2 25041 40391 42822 0 3 25061 40391 0 0 4 25041 …
Sanoj
  • 251
  • 1
  • 2
  • 6
12
votes
5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…
1
2 3
50 51