2

I'm trying to tokenize some sentences into phrases. For instance, given

I think you're cute and I want to know more about you

The tokens can be something like

I think you're cute

and

I want to know more about you

Similarly, given input

Today was great, but the weather could have been better.

Tokens:

Today was great

and

the weather could have been better

Can NLTK or similar packages achieve this?

Any advice appreciated.

John M.
  • 293
  • 1
  • 3
  • 8

1 Answers1

0

Spacy can do this. Spacy's semantic parser is based on Language models trained on large corpus of text.

This parser can break sentence into lower level components such as words / phrases.

More details and examples :

https://spacy.io/usage/linguistic-features

Example with the first sentence from questions: https://explosion.ai/demos/displacy?text=I%20think%20you%27re%20cute%20and%20I%20want%20to%20know%20more%20about%20you&model=en_core_web_sm&cpu=0&cph=0

enter image description here

enter image description here

Shamit Verma
  • 2,239
  • 1
  • 8
  • 14
  • Thanks. I've had a look at semantic parsing, but it's not clear how to identify phrases. For instance, in my first sentence, a semantic parser can identify 'and' as `CCONJ` and 'think' and 'want' are connected by `conj`. It's not clear where the phrase boundaries actually are though. Is there a way to interpret the components to extract the boundaries? – John M. Jan 21 '19 at 04:17
  • This is an example of parsing sub-trees. https://github.com/explosion/spacy/blob/master/examples/information_extraction/parse_subtrees.py I suggest that you try this example of few sentences and see if this works for your corpus. – Shamit Verma Jan 21 '19 at 14:27