Questions tagged [data-wrangling]

55 questions
46
votes
9 answers

How much of data wrangling is a data scientist's job?

I'm currently working as a data scientist at a large company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive…
Victor Valente
  • 569
  • 4
  • 9
12
votes
4 answers

Export pandas to dictionary by combining multiple row values

I have a pandas dataframe df that looks like this name value1 value2 A 123 1 B 345 5 C 712 4 B 768 2 A 318 9 C 178 6 A 321 3 I want to convert…
sfactor
  • 223
  • 1
  • 2
  • 6
4
votes
2 answers

Tools to perform SQL analytics on 350TB of csv data

In short, what would be the best method/tricks/techniques/tools for performing ad hoc sql (style) queries on 350TB of csv data? Would there be other options, tool wise that would do it faster if we dropped the "sql" requirement? Is my best option…
Kevin Vasko
  • 301
  • 1
  • 5
4
votes
2 answers

Data wrangling for a big set of docx files advice!

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm…
mess1n
  • 41
  • 1
4
votes
3 answers

How to deal with count data in random forest

I am working on a classification model where my target class is a biased class with the class shape as 0 1 20694 101 Most of my features are the count of number of times a certain event was triggered. While exploring these features I…
4
votes
2 answers

Mean across every several rows in pandas

I have a table of features and labels where each row has a time stamp. Labels are categorical. They go in a batch where one label repeats several times. Batches with the same label do not have a specific order. The number of repetitions of the same…
Munira
  • 157
  • 2
  • 9
3
votes
0 answers

How to use zero-inflated negative binomial regression for binary classification task?

I am working on a binary classification problem and I am currently employing XGBoost. The dataset consists of several variables which are count variables. The problem is, these features are highly skewed on counts. For example, these are the counts…
3
votes
2 answers

Inputting (a lot of )data into a dataframe one row at a time

I'm using python. Some 2D numpy arrays are stored in individual rows of a Series. They are 30x30 images. It looks something like this: pixels 0 [[23,4,54...],[54,6,7...],[........]] 1 …
3
votes
1 answer

Populate column based on previous row with a twist

I'm struggling with a Pandas problem. I have the following data. +--------+------+---------+---------+-------------+-------------+--------------+------------+-------------+------------+----------+ | symbol | side | status | origQty | executedQty | …
nidkil
  • 131
  • 1
  • 1
  • 3
3
votes
2 answers

When to choose character instead of factor in R?

I am currently working on a dataset which contains a name attribute, which stands for a person's first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not…
lupi5
  • 45
  • 2
  • 2
  • 5
3
votes
3 answers

Giving each person in order their top choice which is still available in Google Sheets

The problem I want to solve is my residential building's garage choices. There will be a random distribution of parking spaces. I thought that it would be better if each person writes down which spaces they want in order of preference, and then…
Heleno Paiva
  • 131
  • 3
2
votes
1 answer

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's…
2
votes
0 answers

Tools for reading data from large, irregular csv files (aka excel file hell)

I have a csv file with 1000 columns and 50 rows that was collected over a few months. Embedded randomly in the file are ~1000 small datasets with semi-standard formats (some columns differ, so does number of rows). There are also identifying…
2
votes
1 answer

What should I do with the NaN values on this stock quote data?

I concatenated 3 stock quote data-frames all with date-time indexes. However, they differ in starting dates so the resulting data-frame contains NaN values for the stock quotes with more recent starting dates. Should I just drop the rows with NaN…
Yoyong
  • 37
  • 4
2
votes
1 answer

R Combine Multiple Rows of DataFrame by creating new columns and union values

I have a dataframe in R that looks like this ID APPROVAL_STEP APPROVAL_STATUS APPROVAL_DATE APPROVER 1234 STEP_A APPROVED 23-Jan-2019 John Smith 1234 STEP_B APPROVED 21-Jan-2019 Jane…
user69420
1
2 3 4