6

I am currently working with Georgian texts processing. Does anybody know any stemmers/lemmatizers (or other NLP tools) for Georgian that I could use with Python.

Thanks in advance!

2 Answers2

7

I don't know any Georgian stemmer or lemmatizer. I think, however, that you have another option: to use unsupervised approaches to segment words into morphemes, and use your linguistic knowledge of Georgian to devise some heuristic rules to identify the stem among them.

This kind of approach consists of a model trained to identify morphemes without any labels (i.e. unsupervisedly). The most relevant Python package for this is Morfessor. You can find its theoretical foundations in these publications: Unsupervised discovery of morphemes; Semi-supervised learning of concatenative morphology.

Also, there is a Python package called Polyglot that offers pre-trained Morfessor models, including one for Georgian. Therefore, my recommendation is for you to use Polyglot's Georgian model to segment words into morphemes and then write some rules by hand to pick the stem among them.

You should be able to evaluate the feasibility of this idea by adapting this example from Polyglot's documentation from English to Georgian (by changing the language code en and the list of words):

from polyglot.text import Text, Word

words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))
noe
  • 22,074
  • 1
  • 43
  • 70
3

If absolutely necessary, You could build your own stemmer. It is fairly simple programming, but takes some studying of the Georgian language in the process, there are however plenty tutorials around the web for building a stemming process.

CB Madsen
  • 131
  • 2