Questions tagged [tokenization]
66 questions
9
votes
6 answers
NLP: What are some popular packages for multi-word tokenization?
I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause…
CyberPlayerOne
- 392
- 1
- 4
- 14
8
votes
1 answer
What tokenizer does OpenAI's GPT3 API use?
I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error.
The closest I got to an answer…
Herman Autore
- 83
- 1
- 3
7
votes
1 answer
Understanding the effect of num_words of Tokenizer in Keras
Consider the following code:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(texts)
print('Found %d unique words.' % len(tokenizer.word_index))
When I run this, it prints:
Found 88582…
Mehran
- 267
- 1
- 2
- 12
5
votes
1 answer
Unigram tokenizer: how does it work?
I have been trying to understand how the unigram tokenizer works since it is used in the sentencePiece tokenizer that I am planning on using, but I cannot wrap my head around it.
I tried to read the original paper, which contains so little details…
Johncowk
- 195
- 1
- 6
5
votes
2 answers
Converting paragraphs into sentences
I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with.
Sample input python list abstracts:
["A total of 2337 articles were found, and, according…
Van Peer
- 285
- 1
- 3
- 12
5
votes
1 answer
Accuracy of word and sent tokenize versus custom tokenizers in nltk
The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence…
MrKickass
- 111
- 8
4
votes
2 answers
ChatGPT: How to use long texts in prompt?
I like the website chatpdf.com a lot. You can upload a PDF file and then discuss the textual content of the file with the file "itself". It uses ChatGPT.
I would like to program something similar. But I wonder how to use the content of long PDF…
meyer_mit_ai
- 63
- 1
- 1
- 5
3
votes
1 answer
NLP: what are the advantages of using a subword tokenizer as opposed to the standard word tokenizer?
I'm looking at this Tensorflow colab tutorial about language translation with Transformers, https://www.tensorflow.org/tutorials/text/transformer, and they tokenize the words with a subword text tokenizer. I have never seen a subword tokenizer…
zipline86
- 349
- 4
- 12
3
votes
1 answer
What is the difference between TextVectorization and Tokenizer?
What is the difference between the layers.TextVectorization() and
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
And when to use what ?
Pritam Sinha
- 193
- 8
2
votes
1 answer
How to i get word embeddings for out of vocabulary words using a transformer model?
When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s.
I would…
cerofrais
- 121
- 4
2
votes
1 answer
From where does BERT get the tokens it predicts?
When BERT is used for masked language modeling, it masks a token and then tries to predict it.
What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a…
Nick Koprowicz
- 213
- 1
- 3
- 10
2
votes
1 answer
Tokenization of data in dataframe in python
I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you.
Below are my codes:
import pandas as pd
import json
import…
Nedisha
- 45
- 1
- 2
- 7
2
votes
1 answer
NLP: What are some popular packages for phrase tokenization?
I'm trying to tokenize some sentences into phrases. For instance, given
I think you're cute and I want to know more about you
The tokens can be something like
I think you're cute
and
I want to know more about you
Similarly, given input
Today…
John M.
- 293
- 1
- 3
- 8
2
votes
1 answer
How to customize word division in CountVectorizer?
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> import numpy
>>> import pandas
>>> vectorizer = CountVectorizer()
>>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud']
>>> xtrain = vectorizer.fit_transform(corpus1)
>>>…
helloworld
- 23
- 1
- 3
2
votes
2 answers
Why cant we use normalise position encodings instead of the cos and sine encodings used in the Transformer paper?
I'm working with Transformer models for sequence-to-sequence tasks and I'm trying to fully understand the use of positional encodings in these models.
In the original "Attention is All You Need" paper by Vaswani et al., positional encodings are…
mutli-arm-bandit
- 23
- 4