Questions tagged [bert]

BERT stands for Bidirectional Encoder Representations from Transformers and is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers

354 questions
67
votes
4 answers

What is purpose of the [CLS] token and why is its encoding output important?

I am reading this article on how to use BERT by Jay Alammar and I understand things up until: For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything…
42
votes
2 answers

What is GELU activation?

I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as $$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which in turn is approximated to $$0.5x(1 + tanh[\sqrt{ 2/π}(x + 0.044715x^3)])$$ Could you simplify the equation…
thanatoz
  • 2,365
  • 4
  • 15
  • 39
37
votes
7 answers

How to get sentence embedding using BERT?

How to get sentence embedding using BERT? from transformers import BertTokenizer tokenizer=BertTokenizer.from_pretrained('bert-base-uncased') sentence='I really enjoyed this movie a lot.' #1.Tokenize the…
star
  • 1,411
  • 7
  • 18
  • 29
27
votes
5 answers

BERT vs Word2VEC: Is bert disambiguating the meaning of the word vector?

Word2vec: Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other…
sovon
  • 521
  • 1
  • 5
  • 7
23
votes
7 answers

Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…
20
votes
1 answer

Can BERT do the next-word-predict task?

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?
15
votes
2 answers

What is the use of [SEP] in paper BERT?

I know that [CLS] means the start of a sentence and [SEP] makes BERT know the second sentence has begun. However, I have a question. If I have 2 sentences, which are s1 and s2, and our fine-tuning task is the same. In one way, I add special tokens…
xiangqing shen
  • 151
  • 1
  • 1
  • 3
14
votes
2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…
TwinPenguins
  • 4,157
  • 3
  • 17
  • 53
12
votes
2 answers

What are the good parameter ranges for BERT hyperparameters while finetuning it on a very small dataset?

I need to finetune BERT model (from the huggingface repository) on a sentence classification task. However, my dataset is really small.I have 12K sentences and only 10% of them are from positive classes. Does anyone here have any experience on…
zwlayer
  • 239
  • 1
  • 2
  • 8
12
votes
1 answer

What is whole word masking in the recent BERT model?

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it: In the original pre-processing code, we randomly select WordPiece tokens to…
kee
  • 223
  • 2
  • 6
12
votes
1 answer

what is the first input to the decoder in a transformer model?

The image is from url: Jay Alammar on transformers K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder. The previous output is…
mLstudent33
  • 574
  • 1
  • 4
  • 17
10
votes
2 answers

Does BERT has any advantage over GPT3?

I have read a couple of documents that explain in detail about the greater edge that GPT-3(Generative Pre-trained Transformer-3) has over BERT(Bidirectional Encoder Representation from Transformers). So am curious to know whether BERT scores better…
Bipin
  • 203
  • 1
  • 2
  • 8
10
votes
2 answers

Why should I understand AI architectures?

Why should I understand what is happening deep down in some AI architecture? For example LSTM-BERT- Partial Conv... Architectures like this. Why should I understand what is going on while I can find any model on the Internet or any implementations…
9
votes
2 answers

Is BERT a language model?

Is BERT a language model in the sense of a function that gets a sentence and returns a probability? I know its main usage is sentence embedding, but can it also provide this functionality?
Amit Keinan
  • 776
  • 6
  • 19
9
votes
1 answer

Bert Fine Tuning with additional features

I want to use Bert for an nlp task. But I also have additional features that I would like to include. From what I have seen, with fine tuning, one only changes the labels and retrains the classification layer. Is there a way to used pre-trained…
Jeff
  • 193
  • 1
  • 3
1
2 3
23 24