1

I've tried stemming and lemmatization on this but nothing has quite worked so far.

How can I resolve country name and nationality as a singular entity?

For example:

Canada and Canadian should just be one entity: Canada Uganda and Ugandan should just be Uganda

It does seem like stemming is one approach here. I just have found it misses a fair number of countries.

2 Answers2

1

I believe that Lemmatisation is the right thing. Anyways another way to do it is to use WordNet. For a word which its POS is Noun you can query if it has a member holonym, this feature will show the country which it belongs. For instance, Canadian -> Canada. Then you have to be careful, cause if you take the member holonym of Canada that will be the British Commonwealth. I guess you could have a Levenshtein distance threshold to ignore these.

You can have a look on the online web app.

20-roso
  • 670
  • 1
  • 5
  • 15
0

I see the tag is already present but you should look into Named Entity Recognition (NER). It is used to tag or find proper nouns in speech and can handle people and locations (of which countries could be thought of as). This page says the popular spaCy library specifically handles countries:

https://spacy.io/api/annotation#named-entities

It would easily catch Uganda, though you would have to experiment with the adjectives ("Ugandan").

Edit based on your comment that the resolving is the real problem. For that I would try:

  1. Use something like word2vec and cosine similarity to analyze all the "caught" words. If you load a pre-trained word2vec model and pass it the words, you'll get back a vector that represents that word. If you then compare all the vectors, I suspect Uganda and Ugandan would have very high similarity. See this answer for someone using word2vec on country names: https://stackoverflow.com/questions/21979970/how-to-use-word2vec-to-calculate-the-similarity-distance-by-giving-2-words
  2. Use a string comparison function to identify caught "locations" that are highly similar to other caught locations. This will work well for some countries (Uganda) but not others (France/French, Switzerland/Swiss).
CalZ
  • 1,653
  • 6
  • 14