0

The typical default for neural networks in natural language processing has been to take words as tokens.

OpenAI Codex is based on GPT-3, but also deals with source code. For source code in general, there is no corresponding obvious choice of tokens, because each programming language has different rules for tokenizing. I don't get the impression Codex uses a separate tokenizer for each language.

What does it take as tokens?

rwallace
  • 127
  • 3
  • 1
    Have you checked [the link on the website of OpenAI regarding their tokenizer](https://platform.openai.com/tokenizer)? It allows you to select their Codex model and paste in text to see how it gets tokenized. It seems to be using byte-pair encoding (BPE) to tokenize the text. – Oxbowerce Mar 04 '23 at 18:36

1 Answers1

2

NLP neural networks don't use word tokens any more. It's been a while since the norm is using subwords. Usual approaches to define the subword vocabulary are byte-pair encoding (BPE), word pieces or unigram tokenization.

GPT-3 uses BPE tokenization. According to the OpenAI's tokenizer tool website:

Codex models use a different set of encodings that handle whitespace more efficiently

From this, I understand that they use BPE but with a different vocabulary. This is supported by this javascript tokenizer that was created by extracting the BPE vocabulary from OpenAI's own online tokenizer tool.

noe
  • 22,074
  • 1
  • 43
  • 70