BERT stands for Bidirectional Encoder Representations from Transformers and is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers
Questions tagged [bert]
354 questions
67
votes
4 answers
What is purpose of the [CLS] token and why is its encoding output important?
I am reading this article on how to use BERT by Jay Alammar and I understand things up until:
For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything…
user3768495
- 887
- 1
- 7
- 8
42
votes
2 answers
What is GELU activation?
I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as
$$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which in turn is approximated to $$0.5x(1 + tanh[\sqrt{
2/π}(x + 0.044715x^3)])$$
Could you simplify the equation…
thanatoz
- 2,365
- 4
- 15
- 39
37
votes
7 answers
How to get sentence embedding using BERT?
How to get sentence embedding using BERT?
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')
sentence='I really enjoyed this movie a lot.'
#1.Tokenize the…
star
- 1,411
- 7
- 18
- 29
27
votes
5 answers
BERT vs Word2VEC: Is bert disambiguating the meaning of the word vector?
Word2vec:
Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other…
sovon
- 521
- 1
- 5
- 7
23
votes
7 answers
Why is the decoder not a part of BERT architecture?
I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…
Hrishikesh Athalye
- 385
- 1
- 3
- 7
20
votes
1 answer
Can BERT do the next-word-predict task?
As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?
DunkOnly
- 661
- 1
- 7
- 16
15
votes
2 answers
What is the use of [SEP] in paper BERT?
I know that [CLS] means the start of a sentence and [SEP] makes BERT know the second sentence has begun.
However, I have a question.
If I have 2 sentences, which are s1 and s2, and our fine-tuning task is the same.
In one way, I add special tokens…
xiangqing shen
- 151
- 1
- 1
- 3
14
votes
2 answers
Preprocessing for Text Classification in Transformer Models (BERT variants)
This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models?
I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…
TwinPenguins
- 4,157
- 3
- 17
- 53
12
votes
2 answers
What are the good parameter ranges for BERT hyperparameters while finetuning it on a very small dataset?
I need to finetune BERT model (from the huggingface repository) on a sentence classification task. However, my dataset is really small.I have 12K sentences and only 10% of them are from positive classes. Does anyone here have any experience on…
zwlayer
- 239
- 1
- 2
- 8
12
votes
1 answer
What is whole word masking in the recent BERT model?
I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it:
In the original pre-processing code, we randomly select WordPiece tokens to…
kee
- 223
- 2
- 6
12
votes
1 answer
what is the first input to the decoder in a transformer model?
The image is from url: Jay Alammar on transformers
K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder.
The previous output is…
mLstudent33
- 574
- 1
- 4
- 17
10
votes
2 answers
Does BERT has any advantage over GPT3?
I have read a couple of documents that explain in detail about the greater edge that GPT-3(Generative Pre-trained Transformer-3) has over BERT(Bidirectional Encoder Representation from Transformers). So am curious to know whether BERT scores better…
Bipin
- 203
- 1
- 2
- 8
10
votes
2 answers
Why should I understand AI architectures?
Why should I understand what is happening deep down in some AI architecture?
For example LSTM-BERT- Partial Conv... Architectures like this. Why should I understand what is going on while I can find any model on the Internet or any implementations…
CanP
- 117
- 1
- 3
9
votes
2 answers
Is BERT a language model?
Is BERT a language model in the sense of a function that gets a sentence and returns a probability?
I know its main usage is sentence embedding, but can it also provide this functionality?
Amit Keinan
- 776
- 6
- 19
9
votes
1 answer
Bert Fine Tuning with additional features
I want to use Bert for an nlp task. But I also have additional features that I would like to include.
From what I have seen, with fine tuning, one only changes the labels and retrains the classification layer.
Is there a way to used pre-trained…
Jeff
- 193
- 1
- 3