NLP: What are some popular packages for phrase tokenization?

Question

I'm trying to tokenize some sentences into phrases. For instance, given

I think you're cute and I want to know more about you

The tokens can be something like

I think you're cute

and

I want to know more about you

Similarly, given input

Today was great, but the weather could have been better.

Tokens:

Today was great

and

the weather could have been better

Can NLTK or similar packages achieve this?

Any advice appreciated.

score 0 · Accepted Answer · answered Jan 20 '19 at 15:26

0

Spacy can do this. Spacy's semantic parser is based on Language models trained on large corpus of text.

This parser can break sentence into lower level components such as words / phrases.

More details and examples :

answered Jan 20 '19 at 15:26

Shamit Verma

Thanks. I've had a look at semantic parsing, but it's not clear how to identify phrases. For instance, in my first sentence, a semantic parser can identify 'and' as `CCONJ` and 'think' and 'want' are connected by `conj`. It's not clear where the phrase boundaries actually are though. Is there a way to interpret the components to extract the boundaries? – John M. Jan 21 '19 at 04:17
This is an example of parsing sub-trees. https://github.com/explosion/spacy/blob/master/examples/information_extraction/parse_subtrees.py I suggest that you try this example of few sentences and see if this works for your corpus. – Shamit Verma Jan 21 '19 at 14:27

1 Answers1