A Quick Introduction to NLTK

Natural Language Tool Kit (NLTK)

Intro to a few NLTK concepts

  1. Tokenization
  2. Stemming & Lemmatization
  3. POS Tagging
  4. Stop Words

Tokenization

Tokenization helps in splitting the text by word or sentence.

import nltk

nltk.word_tokenize(word) # helps split the text by word
nltk.sent_tokenize(word) # helps split the text by sentence

The Word tokenizer uses the Default/TreeBank Tokenizer.
Other Tokenizers include Whitespace-based, Punctuation-based, Tweet and MWE.

Freq of Words

The FreqDist function from nltk allows us to look at the Frequency of words in the tokenized version of a text.

import nltk
from nltk.probability import FreqDist

tokenized_words =  nltk.word_tokenize(word) # split text by word

freq = FreqDist(tokenized_words)
freq.most_common(5) # check the 5  most common words

Stemming & Lemmatization

Stemming - Suffix Stripping

Advantages: Word generalisation, reduced vocabulary size.
Disadvantages: Too simplistic, loss of information, may become obsolete with better language models.

A few well known stemmers are:

  1. PorterStemmer
  2. LancasterStemmer
  3. SnowballStemmer
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer
snowball = SnowballStemmer('english') # using english as the lang

word = 'demonstration'

print(porter.stem(word))
# demonstr

print(lancaster.stem(word))
# demonst

print(snowball.stem(word))
# demonstr

Lemmatization - Picking up root of the word

Lemmatization uses POS tags of the word to lemmatize

import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize("better", pos="a")
# 'good'

POS Tagging

Catgorising words into verbs, noun, adjectives etc.

To find POS tags on specific words

import nltk

text = "......"

tokenized_sents = nltk.tokenize.sent_tokenize(text)
tokenized_words = nltk.word_tokenize(tokenized_sents[0])

nltk.pos_tag(tokenized_words)

Training our own POS Taggers:

Uni-Grams - one word training
Bi-Grams - 2 words or previous words in context
Tri-Grams

from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

from nltk.corpus import brown # data from the nltk corpus

brown_data = brown.tagged_sents(categories='mystery')

# data is generally split into test and train

unigram_tagger = UnigramTagger(brown_data)
bigram_tagger = BigramTagger(brown_data)
trigram_tagger = TrigramTagger(brown_data)

Stop Words

Filler words like me, my, we, our, etc

import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('English')

N-Grams

Combination of grams from bigrams/trigrams or n-grams

The point is to consider our text given its context

from nltk import bigrams, trigrams, everygrams

list(bigrams(tokenized_words))
list(trigrams(tokenized_words))
list(everygrams(tokenized_words, 2, 5)) ### 2 to 5 grams

A very rapid Introduction to NLTK, not perfect but will update this as I understand it even better:)