Questions tagged [data-wrangling]
55 questions
46
votes
9 answers
How much of data wrangling is a data scientist's job?
I'm currently working as a data scientist at a large company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive…
Victor Valente
- 569
- 4
- 9
12
votes
4 answers
Export pandas to dictionary by combining multiple row values
I have a pandas dataframe df that looks like this
name value1 value2
A 123 1
B 345 5
C 712 4
B 768 2
A 318 9
C 178 6
A 321 3
I want to convert…
sfactor
- 223
- 1
- 2
- 6
4
votes
2 answers
Tools to perform SQL analytics on 350TB of csv data
In short, what would be the best method/tricks/techniques/tools for performing ad hoc sql (style) queries on 350TB of csv data? Would there be other options, tool wise that would do it faster if we dropped the "sql" requirement?
Is my best option…
Kevin Vasko
- 301
- 1
- 5
4
votes
2 answers
Data wrangling for a big set of docx files advice!
I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm…
mess1n
- 41
- 1
4
votes
3 answers
How to deal with count data in random forest
I am working on a classification model where my target class is a biased class with the class shape as
0 1
20694 101
Most of my features are the count of number of times a certain event was triggered. While exploring these features I…
Tushar Mehta
- 143
- 3
4
votes
2 answers
Mean across every several rows in pandas
I have a table of features and labels where each row has a time stamp. Labels are categorical. They go in a batch where one label repeats several times. Batches with the same label do not have a specific order. The number of repetitions of the same…
Munira
- 157
- 2
- 9
3
votes
0 answers
How to use zero-inflated negative binomial regression for binary classification task?
I am working on a binary classification problem and I am currently employing XGBoost. The dataset consists of several variables which are count variables. The problem is, these features are highly skewed on counts. For example, these are the counts…
Rohit Gavval
- 321
- 1
- 9
3
votes
2 answers
Inputting (a lot of )data into a dataframe one row at a time
I'm using python. Some 2D numpy arrays are stored in individual rows of a Series. They are 30x30 images. It looks something like this:
pixels
0 [[23,4,54...],[54,6,7...],[........]]
1 …
Isu Shrestha
- 31
- 4
3
votes
1 answer
Populate column based on previous row with a twist
I'm struggling with a Pandas problem. I have the following data.
+--------+------+---------+---------+-------------+-------------+--------------+------------+-------------+------------+----------+
| symbol | side | status | origQty | executedQty | …
nidkil
- 131
- 1
- 1
- 3
3
votes
2 answers
When to choose character instead of factor in R?
I am currently working on a dataset which contains a name attribute, which stands for a person's first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not…
lupi5
- 45
- 2
- 2
- 5
3
votes
3 answers
Giving each person in order their top choice which is still available in Google Sheets
The problem I want to solve is my residential building's garage choices.
There will be a random distribution of parking spaces.
I thought that it would be better if each person writes down which spaces they want in order of preference, and then…
Heleno Paiva
- 131
- 3
2
votes
1 answer
How to preprocess an ordered categorical variable to feed a machine learning algorithm?
I have a categorical variable that measures the income of a family:
A: no income
B: Up to $500
C: $500-$700
…
P: $5000-$6000
Q: More than \\\$6000
It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's…
marcus
- 21
- 2
2
votes
0 answers
Tools for reading data from large, irregular csv files (aka excel file hell)
I have a csv file with 1000 columns and 50 rows that was collected over a few months. Embedded randomly in the file are ~1000 small datasets with semi-standard formats (some columns differ, so does number of rows). There are also identifying…
R Greg Stacey
- 141
- 4
2
votes
1 answer
What should I do with the NaN values on this stock quote data?
I concatenated 3 stock quote data-frames all with date-time indexes.
However, they differ in starting dates so the resulting data-frame contains NaN values for the stock quotes with more recent starting dates.
Should I just drop the rows with NaN…
Yoyong
- 37
- 4
2
votes
1 answer
R Combine Multiple Rows of DataFrame by creating new columns and union values
I have a dataframe in R that looks like this
ID APPROVAL_STEP APPROVAL_STATUS APPROVAL_DATE APPROVER
1234 STEP_A APPROVED 23-Jan-2019 John Smith
1234 STEP_B APPROVED 21-Jan-2019 Jane…
user69420