Highest Voted 'data-wrangling' Questions - Data Science Stack Exchange

46

votes

9 answers

How much of data wrangling is a data scientist's job?

I'm currently working as a data scientist at a large company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive…

data-wrangling

asked Apr 03 '19 at 15:16

Victor Valente

569
4
9

12

votes

4 answers

Export pandas to dictionary by combining multiple row values

I have a pandas dataframe df that looks like this name value1 value2 A 123 1 B 345 5 C 712 4 B 768 2 A 318 9 C 178 6 A 321 3 I want to convert…

python pandas data-wrangling

asked May 29 '18 at 15:48

sfactor

223
1
2
6

4

votes

2 answers

Tools to perform SQL analytics on 350TB of csv data

In short, what would be the best method/tricks/techniques/tools for performing ad hoc sql (style) queries on 350TB of csv data? Would there be other options, tool wise that would do it faster if we dropped the "sql" requirement? Is my best option…

bigdata dataset data-wrangling

asked Jan 07 '16 at 02:33

Kevin Vasko

301
1
5

4

votes

2 answers

Data wrangling for a big set of docx files advice!

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm…

python similar-documents data-wrangling

asked Jun 29 '19 at 11:16

mess1n

41
1

4

votes

3 answers

How to deal with count data in random forest

I am working on a classification model where my target class is a biased class with the class shape as 0 1 20694 101 Most of my features are the count of number of times a certain event was triggered. While exploring these features I…

machine-learning random-forest data-cleaning machine-learning-model data-wrangling

asked Feb 12 '19 at 22:59

Tushar Mehta

143
3

4

votes

2 answers

Mean across every several rows in pandas

I have a table of features and labels where each row has a time stamp. Labels are categorical. They go in a batch where one label repeats several times. Batches with the same label do not have a specific order. The number of repetitions of the same…

python pandas sql data-wrangling data-table

asked Jan 10 '19 at 12:42

Munira

157
2
9

3

votes

0 answers

How to use zero-inflated negative binomial regression for binary classification task?

I am working on a binary classification problem and I am currently employing XGBoost. The dataset consists of several variables which are count variables. The problem is, these features are highly skewed on counts. For example, these are the counts…

machine-learning statistics data-cleaning data-wrangling

asked Jul 26 '19 at 07:18

Rohit Gavval

321
1
9

3

votes

2 answers

Inputting (a lot of )data into a dataframe one row at a time

I'm using python. Some 2D numpy arrays are stored in individual rows of a Series. They are 30x30 images. It looks something like this: pixels 0 [[23,4,54...],[54,6,7...],[........]] 1 …

python pandas numpy dataframe data-wrangling

asked Feb 21 '19 at 06:08

Isu Shrestha

31
4

3

votes

1 answer

Populate column based on previous row with a twist

pandas data-wrangling

asked Feb 14 '18 at 23:13

nidkil

131
1
1
3

3

votes

2 answers

When to choose character instead of factor in R?

I am currently working on a dataset which contains a name attribute, which stands for a person's first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not…

r data-wrangling

asked Jun 01 '16 at 16:11

lupi5

45
2
2
5

3

votes

3 answers

Giving each person in order their top choice which is still available in Google Sheets

The problem I want to solve is my residential building's garage choices. There will be a random distribution of parking spaces. I thought that it would be better if each person writes down which spaces they want in order of preference, and then…

classification excel data-wrangling

asked May 23 '22 at 11:33

Heleno Paiva

131
3

2

votes

1 answer

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's…

machine-learning dataset preprocessing data-wrangling

asked Aug 20 '20 at 18:58

marcus

21
2

2

votes

0 answers

Tools for reading data from large, irregular csv files (aka excel file hell)

I have a csv file with 1000 columns and 50 rows that was collected over a few months. Embedded randomly in the file are ~1000 small datasets with semi-standard formats (some columns differ, so does number of rows). There are also identifying…

python r data-cleaning excel data-wrangling

asked Sep 17 '19 at 22:55

R Greg Stacey

141
4

2

votes

1 answer

What should I do with the NaN values on this stock quote data?

I concatenated 3 stock quote data-frames all with date-time indexes. However, they differ in starting dates so the resulting data-frame contains NaN values for the stock quotes with more recent starting dates. Should I just drop the rows with NaN…

time-series data data-wrangling

asked May 05 '19 at 18:29

Yoyong

37
4

2

votes

1 answer

R Combine Multiple Rows of DataFrame by creating new columns and union values

I have a dataframe in R that looks like this ID APPROVAL_STEP APPROVAL_STATUS APPROVAL_DATE APPROVER 1234 STEP_A APPROVED 23-Jan-2019 John Smith 1234 STEP_B APPROVED 21-Jan-2019 Jane…

r data-cleaning data-wrangling

asked Mar 12 '19 at 17:36

user69420

Questions tagged [data-wrangling]