1

I have the following problem:

There is a large set of records. Each record in the set has an attribute. For some values of the attribute, there is only one record, for other values there are many records with the value. I want to construct the distribution of counts of the attribute with certain count of records. For example that there is a billion values of the attribute which have exatly one record, half a billion with two records, ..., and finaly one value of the attribute with five million records. Imagine a list of all dogs in the world with the name of the owner as the attribute. I need to know how many people have one dog, how many have two dogs and so on. For small sets this distribution is easy to compute but for a large set a lot of memory is needed. I am interrested in approximate methods which use some kind of sampling to obtain results with bounded precission. I am sure, this must be a well studied problem, but I do not know what to look for.

Could you please give me some pointers, where to look and how is this thing called in literature?

danatel
  • 111
  • 1
  • This would be a bar chart typically with a bar for each category of dog-ownership. The one thing you are going to struggle with here is that you have vastly different orders of magnitude, so types with 1 or 2 members and others with a billion, those will not plot well together visually – sconfluentus Oct 25 '22 at 15:11
  • Are you looking for [count-min sketch](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch)? I found it through the Wikipedia page for the related [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm which pointed to this [nice site](https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining) where your problem was mentioned. – Valentas Oct 27 '22 at 09:36
  • @Valentas - thank you very much for the pointer but I don't think this is the algorithm I need. The count-min-sketch gives frequency of a particular event (how many dogs are owned by Alice?). I believe my problem is in fact simpler - and that it can be solved by sampling. – danatel Oct 28 '22 at 16:44
  • Indeed, approximate frequencies of the top most frequent elements is not enough for you, you need one more aggregation step. – Valentas Oct 28 '22 at 18:17

1 Answers1

0

I think you can use this answer or references mentioned there to sample a certain number N of distinct names and one more scan to compute counts and build an approximate histogram.

I have not checked if it is the same, but a simple idea might be to first use reservoir sampling to pick M > N random records (discarding ones containing duplicate values), and then retain $i$-th record only if it happens to be be the first record with that value in your array.

Valentas
  • 1,064
  • 1
  • 8
  • 20