In the paper Bag of Tricks for Efficient Text Classification they talk about creating n-gram (word) features, and in their experiments they show results for both 1-gram and bi-gram.
As far as I understand FastText it is simply wordembedding based on characters instead of words as e.g word2vec. And as far as I understand the paper, they simply represent each document as an average of each embedded word, and use that as features in a logistic regression
My question is; where does the word n-gram enter and how is it created/used? How would you make n-gram features of wordembeddings when each sentence is averaged, or do I misunderstand something?