1

I have found a number of parsers for the automatic extraction of institution names from texts (e.g. this one). My task is in a sense the inverse one: I want to automatically generate reality-like institution names, with a possibility to differentiate them by type (private-held, public, educational, etc.) and by branch.

Is there any algorithm / application / papers to be found? Alternatively, is there any (free access) database with such data?

Alex Konnen
  • 111
  • 2
  • 1
    Do you mean generate fictional entities? Anyway it's not really the inverse of NER, because NER relies on the context words in a sentence to detect entities names. In theory NER should work as well with fictional entities than with real ones (except that it knows some real ones from the training). – Erwan Jun 21 '20 at 15:35
  • 1
    Important question: do you have training data? A.k.a institution namesz differentiated by type – Guillermo Mosse Jun 21 '20 at 16:38
  • Thanks Erwan, thanks Guillermo. I do understand that what I want is not exactly the inverse of NER, therefore the question. And no, I do mean "fictional" names. I do not have therefore any training data, but just a vague idea of classification (Private Company vs. Public Institution vs. Educational etc.) – Alex Konnen Jun 22 '20 at 06:10

2 Answers2

1

Conditional on you having data, yes, you can. Check out Generative Adversarial Networks and/or Reinforcement Learning for text generation. This paper is a good starting point: https://openreview.net/forum?id=rJedV3R5tm.

Also, here's a tool that might help you. What you can do is generat these institution names without differentiating by type, and then build another model to classify them.

Guillermo Mosse
  • 335
  • 1
  • 8
  • Guillermo, many thanks for the link. Highly interesting; the stuff ins very new to me, so I will probably need more time to digest (I did a random text generator myself once - but without almost any science). If I am not mistaken, however, the second link does not lead to any tool, but to the paper itself - am I wrong? – Alex Konnen Jun 22 '20 at 06:25
  • 1
    Thanks Alex! You are right, my mistake. The link points to the tool now. – Guillermo Mosse Jun 22 '20 at 07:52
1

If you want to build your own dataset, you could look at packages such as:

They both provide features to generate company/institution names based on certain locales as well.

If your goal is to generate training data for a NER task, this should be a good start. If it's to generate company names, this will already cover quite a bit.

Valentin Calomme
  • 5,396
  • 3
  • 20
  • 49
  • 1
    Thanks Valentin, I have already had a look at Mimesis, seems to be very attractive; Faker is a bit more complicated for me, as I am not a native Pythonese speaker (rether C#). But I am going to take some time to look at it, too. – Alex Konnen Jun 22 '20 at 10:02