N-gram is a fundamental concept in NLP and is used in many language models. In simple terms, N-gram is nothing but a sequence on N words - e.g. San Francisco (is a 2-gram) and The Three Musketeers (is a 3-gram).
N-grams are very useful because they can be used for making next word predictions, correcting spellings or grammar. Ever wondered how Gmail is able to suggest auto-completion of sentences? This is possible because Google has created a language model that can predict next words.
N-grams are also used for correcting spelling errors - e.g. “drink cofee” could be corrected to “drink coffee” because the language model can predict that 'drink' and 'coffee' being together have a high probability. Also the 'edit distance' between 'cofee' and 'coffee' is 1, hence it is a typo.
Thus N-grams are used to create probabilistic language models called n-gram models. N-gram models predict the occurrence of a word based on its N – 1 previous word.
The 'N' depends on the type of analysis we want to do - e.g. Research has also shown that trigrams and 4-grams work the best for spam filtering.
No comments:
Post a Comment