0

I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page).

For instance I have

Work, links, wikipediaTranslatedPages
The name of the rose, 500, 53

I used a scatterplot but it's weird. Is it wrong?enter image description here

  • It's kind of the same idea as removing outliers: you could plot only the points which are in the bottom left corner by specifying max X=100 for instance. You could also use transparency to show where there are many points vs. isolated points. And of course you can calculate Pearson correlation. – Erwan Mar 29 '22 at 15:29

1 Answers1

0

I can't say if your scatterplot is correct or not, because I don't know your dataset. I suppose that the point with total = 1.800 and numWikipediaLanguages = 53 is an outlier. So, you can try to delete it and replot the graph.

Another test that you could try is to add a feature called "subject" and divide your data (i.e.: subject -> "history", "math", "science" and so on). Follow youe example:

Work, links, wikipediaTranslatedPages, Subject
The name of the rose, 500, 53, Literature

In this way you can see if there is a particular class of items (subject) that stands out from the others. But I don't know your data or your problem and if you have the possibility to add a feature.

Inuraghe
  • 481
  • 3
  • 17
  • The data is basically that, I want to see if on Wikipedia in general there is a correlation inlinks-pages that are available in different languages – Idkwhatnomeis Mar 29 '22 at 13:11
  • The graph is ok, but are you sure about the attendibility of your data? If not, try to delete the outlier – Inuraghe Mar 29 '22 at 13:13
  • Yes. There are basically items with not so many wiki languages but with hundreds of inlinks – Idkwhatnomeis Mar 29 '22 at 13:15