6

1) I have just started working on NLP the basic Idea is to extract meaningful information from text. For this I am using "Spacy".

As far as I have studied Spacy has following entities.

  • ORG
  • PERSON
  • DATE
  • MONEY
  • CARDINAL

etc. But I want to add custom entities like:

Nokia-3310 should be labeled as Mobile and XBOX should be labeled as Games

2) Can I find some already trained models in Spacy to work on ?

AddyProg
  • 163
  • 1
  • 6

1 Answers1

6

For pretrained models, spaCy has a few in different languages. You can find them in their official documentation https://spacy.io/models

The available models are:

  1. English
  2. German
  3. French
  4. Spanish
  5. Portuguese
  6. Italian
  7. Dutch
  8. Greek
  9. Multi-language

If you want support for extra labels in NER, you could train a model in your own dataset. Again, this is possible in spaCy and from their official documentation https://spacy.io/usage/training#ner, here is an example

LABEL = "ANIMAL"

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]


nlp = spacy.blank("en")  # create blank Language class
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

ner.add_label(LABEL)  # add new entity label to entity recognizer

optimizer = nlp.begin_training()

move_names = list(ner.move_names)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

with nlp.disable_pipes(*other_pipes):  # only train NER
    sizes = compounding(1.0, 4.0, 1.001)
    # batch up the examples using spaCy's minibatch
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        batches = minibatch(TRAIN_DATA, size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

If you want to use an existing model and also add a new custom Label, you can read the linked article in their documentation where they describe the process in details. Actually, it is quite similar to the code above.

Tasos
  • 3,860
  • 4
  • 22
  • 54
  • Thanks for the reply a quick question: It creates a blank class 'en' for entity recognition I am using "en_core_web_sm". Does this piece of code trains the "en_core_web_sm" ? – AddyProg Aug 19 '19 at 07:19
  • 1
    No. As I mentioned, this creates an empty model that you will train. If you want to take the model `en_core_web_sm` and add your own entities on top of that, it's again quite easy. Just need to add a few extra lines on the above. It's there on the documentation I linked on the answer. – Tasos Aug 19 '19 at 07:40