3

I'm hoping you have some research or experience with determining the completeness of a data set. I'm trying to use a twitter dataset I scraped myself and want to have an indication on the completeness. Obviously, I will miss some data but I am wondering if there is a formula or method to calculate a probability on the completeness.

  • 1
    Can you formalize your notion of completeness a little bit? – Emre Feb 25 '16 at 17:16
  • @emre I agree, what are you trying to measure exactly? What is your purpose? This paper discusses different definitions of completeness. http://www.sciencedirect.com/science/article/pii/S1532046413000853 – Juan Leni Feb 26 '16 at 10:48

1 Answers1

2

If you are talking about exploring your data for patterns of missing data you can try using Self-Organizing Maps [https://en.wikipedia.org/wiki/Self-organizing_map], which are a special flavor of neural networks. Here is a small research paper explaining the concept a bit. Here is another [link][3] with historical info and links to some original papers about the topic, specifically by Kohonen et al.

From Wikipedia:

Self-organizing maps are different from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space. This makes SOMs useful for visualizing low-dimensional views of high-dimensional data...

Essentially this method will allow you to look for patterns in your data. From there you can determine what level of completeness you are dealing with in your data set.

Specifically for your example for Twitter data, I would imagine there are many fields in the JSON data with missing values. Maybe some users choose not to fill in their gender or age etc. Visualizing your data as well calculating summary statistics will help you to paint an overall picture of your data. And when you have high dimensional data using tools to visualize it in a lower dimensional space can always be handy. Hope this helps you!

kmshannon
  • 21
  • 2