11

I feel like this is a rudimentary question but I'm very new to this and just haven't been able to crack it / find the answer.

Ultimately what I'm trying to do here is to count unique values on a certain column and then determine which of those unique values have more than one unique value in a matching column.

So for this data, what I am trying to determine is "who" has "more than one receipt" for all purchases, then determine the same information based on each product category.

My approach so far:

We have a dataset like this:

receipt,name,etc,category
1,george,xxx,fish
1,george,xxx,cat
2,george,xxx,fish
3,bill,xxx,fish
3,bill,xxx,dog
4,jill,xxx,cat
5,bill,xxx,cat
5,bill,xxx,cat
5,bill,xxx,dog
6,george,xxx,fish

So then I can do this:

df.set_index(['name','receipt'])

And get the more interesting

                etc category
name   receipt
george 1        xxx     fish
       1        xxx      cat
       2        xxx     fish
bill   3        xxx     fish
       3        xxx      dog
jill   4        xxx      cat
bill   5        xxx      cat
       5        xxx      cat
       5        xxx      dog
george 6        xxx     fish

At this point it feels to me like the data is easy to work with, but I haven't figured it out.

One thing that is interesting to me is that if I sort the data by name before indexing it, the data displays grouped by name. In both cases the index is the same, so I don't know how to play with the representation of the data after indexing.

It is easy to find the data by category using

>>> orders.loc[orders['category'] == 'fish']
                etc category
name   receipt
george 1        xxx     fish
       2        xxx     fish
bill   3        xxx     fish
george 6        xxx     fish

But what I can't figure out is how to tell pandas "Find me the list of names that have more than one receipt".

Smaller questions:

  • What is the "pandas way" to get the length of the names part of the index? I'm supposing I could just turn the name column into a set and get the length of that. But I'm curious about indexes.

Edit / Update

Thanks for those answers! Here is clarification on what I am looking for:

I'm trying to find "repeat customers": people with more than one receipt.

So my set of all customers would be:

names: ['george','bill','jill'], ratio: 1.0

My repeat customers:

names: ['george','bill'], ratio 0.66

All 'fish' customers:

names: ['george','bill'], ratio: 0.666

My repeat 'fish' customers:

names: ['george'], ratio: 0.333

I think the examples given look helpful, but feel free to add anything.

Mr. Hasquestions
  • 113
  • 1
  • 1
  • 6

2 Answers2

7

I think maybe you are looking for:

receipts_by_name_x_cat = df.groupby(['name','category']).count()

Or, if you just want the total across all categories:

receipts_by_name = df.groupby(['name']).count()

Then, you can search those who have more than one:

receipts_by_name[receipts_by_name['receipt']>1]

And, you can find the length of an index by typing:

len(df.index.get_level_values(0))

Assuming the name was the first index column (otherwise substitute 1, 2, etc.)

atkat12
  • 278
  • 2
  • 5
  • This is extremely helpful. I worded my question wrong though. "More than one receipt" should be "More than one transaction, meaning they have more than one **unique** receipt." – Mr. Hasquestions Nov 10 '16 at 19:16
  • Glad to hear it. Re: your question, how are the two different? what defines a unique receipt - is the 'receipt' column the receipt id? and the category is all the items they bought on that receipt? If so you can try `receipts_by_name = df.groupby(['name']).unique()` and then take the answer in the 'receipt' column, I think. You just need to apply a different groupby function. – atkat12 Nov 11 '16 at 17:35
  • Right, so one way of thinking about it is each row in the dataset represents a line item in a purchase, so the receipt is the "receipt ID" of that transaction. So for example, Bill bought two cat items during transaction #5 and he is responsible for transaction #2. So he is a "repeat shopper" but he is not a "repeat cat shopper" since his two cat purchases happened during the same transaction. – Mr. Hasquestions Nov 11 '16 at 21:58
3

It's not quite clear what exactly are you trying to achieve (it would be helpful to understand your goals if you would post desired / expected data sets)...

But i'll try to guess ;)

Data:

In [100]: df
Out[100]:
   receipt    name  etc category
0        1  george  xxx     fish
1        1  george  xxx      cat
2        2  george  xxx     fish
3        3    bill  xxx     fish
4        3    bill  xxx      dog
5        4    jill  xxx      cat
6        5    bill  xxx      cat
7        5    bill  xxx      cat
8        5    bill  xxx      dog
9        6  george  xxx     fish

Setting virtual column count, showing # of rows grouped by name and filtering (querying) it using .query() method:

In [101]: (df.assign(count=df.groupby('name').receipt.transform('size'))
     ...:    .query("category in ['dog','cat'] and count > 1"))
     ...:
Out[101]:
   receipt    name  etc category  count
1        1  george  xxx      cat      4
4        3    bill  xxx      dog      5
6        5    bill  xxx      cat      5
7        5    bill  xxx      cat      5
8        5    bill  xxx      dog      5

or you can group by several columns and filter resulting groups:

In [102]: df.groupby(['name','category']).filter(lambda x: len(x) > 2)
Out[102]:
   receipt    name  etc category
0        1  george  xxx     fish
2        2  george  xxx     fish
9        6  george  xxx     fish