Questions tagged [tokenization]

66 questions
9
votes
6 answers

NLP: What are some popular packages for multi-word tokenization?

I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause…
CyberPlayerOne
  • 392
  • 1
  • 4
  • 14
8
votes
1 answer

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer…
Herman Autore
  • 83
  • 1
  • 3
7
votes
1 answer

Understanding the effect of num_words of Tokenizer in Keras

Consider the following code: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words = 5000) tokenizer.fit_on_texts(texts) print('Found %d unique words.' % len(tokenizer.word_index)) When I run this, it prints: Found 88582…
Mehran
  • 267
  • 1
  • 2
  • 12
5
votes
1 answer

Unigram tokenizer: how does it work?

I have been trying to understand how the unigram tokenizer works since it is used in the sentencePiece tokenizer that I am planning on using, but I cannot wrap my head around it. I tried to read the original paper, which contains so little details…
Johncowk
  • 195
  • 1
  • 6
5
votes
2 answers

Converting paragraphs into sentences

I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with. Sample input python list abstracts: ["A total of 2337 articles were found, and, according…
Van Peer
  • 285
  • 1
  • 3
  • 12
5
votes
1 answer

Accuracy of word and sent tokenize versus custom tokenizers in nltk

The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence…
MrKickass
  • 111
  • 8
4
votes
2 answers

ChatGPT: How to use long texts in prompt?

I like the website chatpdf.com a lot. You can upload a PDF file and then discuss the textual content of the file with the file "itself". It uses ChatGPT. I would like to program something similar. But I wonder how to use the content of long PDF…
meyer_mit_ai
  • 63
  • 1
  • 1
  • 5
3
votes
1 answer

NLP: what are the advantages of using a subword tokenizer as opposed to the standard word tokenizer?

I'm looking at this Tensorflow colab tutorial about language translation with Transformers, https://www.tensorflow.org/tutorials/text/transformer, and they tokenize the words with a subword text tokenizer. I have never seen a subword tokenizer…
zipline86
  • 349
  • 4
  • 12
3
votes
1 answer

What is the difference between TextVectorization and Tokenizer?

What is the difference between the layers.TextVectorization() and from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences And when to use what ?
Pritam Sinha
  • 193
  • 8
2
votes
1 answer

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would…
2
votes
1 answer

From where does BERT get the tokens it predicts?

When BERT is used for masked language modeling, it masks a token and then tries to predict it. What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a…
Nick Koprowicz
  • 213
  • 1
  • 3
  • 10
2
votes
1 answer

Tokenization of data in dataframe in python

I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you. Below are my codes: import pandas as pd import json import…
Nedisha
  • 45
  • 1
  • 2
  • 7
2
votes
1 answer

NLP: What are some popular packages for phrase tokenization?

I'm trying to tokenize some sentences into phrases. For instance, given I think you're cute and I want to know more about you The tokens can be something like I think you're cute and I want to know more about you Similarly, given input Today…
John M.
  • 293
  • 1
  • 3
  • 8
2
votes
1 answer

How to customize word division in CountVectorizer?

>>> from sklearn.feature_extraction.text import CountVectorizer >>> import numpy >>> import pandas >>> vectorizer = CountVectorizer() >>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud'] >>> xtrain = vectorizer.fit_transform(corpus1) >>>…
helloworld
  • 23
  • 1
  • 3
2
votes
2 answers

Why cant we use normalise position encodings instead of the cos and sine encodings used in the Transformer paper?

I'm working with Transformer models for sequence-to-sequence tasks and I'm trying to fully understand the use of positional encodings in these models. In the original "Attention is All You Need" paper by Vaswani et al., positional encodings are…
1
2 3 4 5