Highest Voted 'tokenization' Questions - Data Science Stack Exchange

9

votes

6 answers

NLP: What are some popular packages for multi-word tokenization?

I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause…

nlp nltk tokenization

asked Mar 02 '17 at 07:04

CyberPlayerOne

392
1
4
14

8

votes

1 answer

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer…

python-3.x tokenization gpt

asked Jul 08 '21 at 18:07

Herman Autore

83
1
3

7

votes

1 answer

Understanding the effect of num_words of Tokenizer in Keras

Consider the following code: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words = 5000) tokenizer.fit_on_texts(texts) print('Found %d unique words.' % len(tokenizer.word_index)) When I run this, it prints: Found 88582…

keras tokenization

asked Aug 19 '18 at 21:50

Mehran

267
1
2
12

5

votes

1 answer

Unigram tokenizer: how does it work?

I have been trying to understand how the unigram tokenizer works since it is used in the sentencePiece tokenizer that I am planning on using, but I cannot wrap my head around it. I tried to read the original paper, which contains so little details…

nlp transformer tokenization

asked Feb 02 '21 at 13:28

Johncowk

195
1
6

5

votes

2 answers

Converting paragraphs into sentences

I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with. Sample input python list abstracts: ["A total of 2337 articles were found, and, according…

nlp spacy tokenization information-extraction

asked Jan 11 '21 at 10:29

Van Peer

285
1
3
12

5

votes

1 answer

Accuracy of word and sent tokenize versus custom tokenizers in nltk

The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence…

python nlp nltk tokenization

asked Dec 30 '17 at 11:22

MrKickass

111
8

4

votes

2 answers

ChatGPT: How to use long texts in prompt?

I like the website chatpdf.com a lot. You can upload a PDF file and then discuss the textual content of the file with the file "itself". It uses ChatGPT. I would like to program something similar. But I wonder how to use the content of long PDF…

transformer gpt tokenization chatbot

asked Mar 18 '23 at 12:46

meyer_mit_ai

63
1
1
5

3

votes

1 answer

NLP: what are the advantages of using a subword tokenizer as opposed to the standard word tokenizer?

I'm looking at this Tensorflow colab tutorial about language translation with Transformers, https://www.tensorflow.org/tutorials/text/transformer, and they tokenize the words with a subword text tokenizer. I have never seen a subword tokenizer…

tensorflow nlp colab tokenization

asked Oct 09 '20 at 08:37

zipline86

349
4
12

3

votes

1 answer

What is the difference between TextVectorization and Tokenizer?

What is the difference between the layers.TextVectorization() and from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences And when to use what ?

keras nlp tokenization

asked Dec 07 '21 at 16:17

Pritam Sinha

193
8

2

votes

1 answer

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would…

nlp transformer stanford-nlp tokenization huggingface

asked Jan 13 '21 at 07:02

cerofrais

121
4

2

votes

1 answer

From where does BERT get the tokens it predicts?

When BERT is used for masked language modeling, it masks a token and then tries to predict it. What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a…

nlp bert language-model tokenization

asked Nov 16 '20 at 19:00

Nick Koprowicz

213
1
3
10

2

votes

1 answer

Tokenization of data in dataframe in python

I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you. Below are my codes: import pandas as pd import json import…

python dataframe tokenization

asked Feb 12 '20 at 17:57

Nedisha

45
1
2
7

2

votes

1 answer

NLP: What are some popular packages for phrase tokenization?

I'm trying to tokenize some sentences into phrases. For instance, given I think you're cute and I want to know more about you The tokens can be something like I think you're cute and I want to know more about you Similarly, given input Today…

nlp nltk tokenization

asked Jan 20 '19 at 09:45

John M.

293
1
3
8

2

votes

1 answer

How to customize word division in CountVectorizer?

>>> from sklearn.feature_extraction.text import CountVectorizer >>> import numpy >>> import pandas >>> vectorizer = CountVectorizer() >>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud'] >>> xtrain = vectorizer.fit_transform(corpus1) >>>…

python scikit-learn regex ngrams tokenization

asked Jun 14 '18 at 14:54

helloworld

23
1
3

2

votes

2 answers

Why cant we use normalise position encodings instead of the cos and sine encodings used in the Transformer paper?

I'm working with Transformer models for sequence-to-sequence tasks and I'm trying to fully understand the use of positional encodings in these models. In the original "Attention is All You Need" paper by Vaswani et al., positional encodings are…

word-embeddings transformer embeddings attention-mechanism tokenization

asked Aug 03 '23 at 09:14

mutli-arm-bandit

23
4

Questions tagged [tokenization]