Highest Voted 'bert' Questions - Data Science Stack Exchange

67

votes

4 answers

What is purpose of the [CLS] token and why is its encoding output important?

I am reading this article on how to use BERT by Jay Alammar and I understand things up until: For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything…

asked Jan 09 '20 at 17:20

user3768495

887
1
7
8

42

votes

2 answers

What is GELU activation?

I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as $$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which in turn is approximated to $$0.5x(1 + tanh[\sqrt{ 2/π}(x + 0.044715x^3)])$$ Could you simplify the equation…

activation-function bert mathematics

asked Apr 18 '19 at 08:06

thanatoz

2,365
4
15
39

37

votes

7 answers

How to get sentence embedding using BERT?

How to get sentence embedding using BERT? from transformers import BertTokenizer tokenizer=BertTokenizer.from_pretrained('bert-base-uncased') sentence='I really enjoyed this movie a lot.' #1.Tokenize the…

tensorflow nlp pytorch bert

asked Nov 04 '19 at 15:22

star

1,411
7
18
29

27

votes

5 answers

BERT vs Word2VEC: Is bert disambiguating the meaning of the word vector?

Word2vec: Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other…

word2vec word-embeddings bert

asked Jun 21 '19 at 16:25

sovon

521
1
5
7

23

votes

7 answers

Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…

nlp bert machine-translation attention-mechanism

asked Dec 21 '19 at 17:09

Hrishikesh Athalye

385
1
3
7

20

votes

1 answer

Can BERT do the next-word-predict task?

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?

neural-network deep-learning attention-mechanism transformer bert

asked Feb 28 '19 at 08:37

DunkOnly

661
1
7
16

15

votes

2 answers

What is the use of [SEP] in paper BERT?

I know that [CLS] means the start of a sentence and [SEP] makes BERT know the second sentence has begun. However, I have a question. If I have 2 sentences, which are s1 and s2, and our fine-tuning task is the same. In one way, I add special tokens…

machine-learning nlp transformer bert

asked May 07 '19 at 04:53

xiangqing shen

151
1
1
3

14

votes

2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…

python nlp preprocessing bert transformer

asked Nov 08 '19 at 06:28

TwinPenguins

4,157
3
17
53

12

votes

2 answers

What are the good parameter ranges for BERT hyperparameters while finetuning it on a very small dataset?

I need to finetune BERT model (from the huggingface repository) on a sentence classification task. However, my dataset is really small.I have 12K sentences and only 10% of them are from positive classes. Does anyone here have any experience on…

deep-learning bert finetuning

asked Dec 10 '19 at 18:31

zwlayer

239
1
2
8

12

votes

1 answer

What is whole word masking in the recent BERT model?

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it: In the original pre-processing code, we randomly select WordPiece tokens to…

nlp language-model bert

asked Jun 15 '19 at 23:13

kee

223
2
6

12

votes

1 answer

what is the first input to the decoder in a transformer model?

The image is from url: Jay Alammar on transformers K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder. The previous output is…

nlp sequence bert transformer

asked May 11 '19 at 08:36

mLstudent33

574
1
4
17

10

votes

2 answers

Does BERT has any advantage over GPT3?

I have read a couple of documents that explain in detail about the greater edge that GPT-3(Generative Pre-trained Transformer-3) has over BERT(Bidirectional Encoder Representation from Transformers). So am curious to know whether BERT scores better…

nlp bert gpt

asked Sep 12 '20 at 04:37

Bipin

203
1
2
8

10

votes

2 answers

Why should I understand AI architectures?

Why should I understand what is happening deep down in some AI architecture? For example LSTM-BERT- Partial Conv... Architectures like this. Why should I understand what is going on while I can find any model on the Internet or any implementations…

machine-learning deep-learning cnn machine-learning-model bert

asked Nov 07 '21 at 13:20

CanP

117
1
3

9

votes

2 answers

Is BERT a language model?

Is BERT a language model in the sense of a function that gets a sentence and returns a probability? I know its main usage is sentence embedding, but can it also provide this functionality?

nlp bert transformer language-model

asked May 13 '20 at 12:22

Amit Keinan

776
6
19

9

votes

1 answer

Bert Fine Tuning with additional features

I want to use Bert for an nlp task. But I also have additional features that I would like to include. From what I have seen, with fine tuning, one only changes the labels and retrains the classification layer. Is there a way to used pre-trained…

nlp bert

asked Mar 05 '19 at 02:57

Jeff

193
1
3

Questions tagged [bert]