Why are character level models considered less effective than word level models?

Question

I have read that character level models need more computation power than word embeddings, and this is one of the major reasons for their less effectiveness, but i got curious because the word embeddings need a huge vocabulary while character level models need very less vocabulary, so why is it not considered?

score 4 · Answer 1 · answered Jun 10 '20 at 15:30

You are absolutely right about vocabulary size. I am actually conducting research on making character-level more effective.

Here is why word-level tokens are often favoured despite, characters requiring a much smaller vocabulary size.

Bag-of-word

In a bag-of-words scenario, it is pretty obvious. First, the name. Second, if you receive a word cloud of most common words, you may be able to know what the document is about. If you receive a character cloud, you would likely be completely lost. So would your computer.

Character tokens contain much less information than word tokens.

Sequential approaches

Whether you use RNNs or Transformers, dealing with text as a sequence is often where the difference between words and characters is less obvious.

Words will create sequences containing larger vectors since each vector needs to encode more information. However, it is quite rare to use a one-hot encoding approach, which would require a vector to be the size of the vocabulary (usually 30,000). Instead, word embedding are used, which are usually less than 1024-dimensional, often between 100 and 300. So, practically, those vectors are not huge.

Characters can require much smaller embeddings. Based on my research, embeddings can be as low as 32-dimensional. But, on average, at least for western languages, each word contains 7 characters. Meaning that your models will need to deal with a sequence that is 7 times bigger than if you were dealing with words.

So in terms of actual size, if you use 32-dimensional embeddings for characters, it would take as much "space" to use 224-dimensional word embeddings.

Now, where is the difference? Ultimately it comes down to this:

Characters allow you to virtually not have any "out-of-vocabulary characters" (~99.99% using a vocabulary of 300 tokens) whereas words usually only cover 50% of a language if you use 30,000 tokens. Preprocessing techniques (stemming, amortization) are there to help reduce vocabulary size, but one can argue that it removes valuable information from the input.
Words are more informative than characters and word embeddings have shown to embed information really well. Basically, performance is often good enough, and computational constraints not too big for people to worry.

In summary, when using character-level tokens, your model has to do more work. The input itself is more raw and contains less information. Plus, sequential models have shown to struggle with long-range dependencies.

Why are character level models considered less effective than word level models?

1 Answers1