Vectorizing Text Data
Tags: machine learning, sklearn, nlp
# import necessary modules
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from evaluation import test
from utils import load_data
from sklearn.naive_bayes import MultinomialNB
message0 = 'hello world hello hello world play'
message1 = 'test test test test one hello'
#Convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform([message0, message1])
print(bow)
print(type(bow))
(0, 0) 3
(0, 4) 2
(0, 2) 1
(1, 0) 1
(1, 3) 4
(1, 1) 1
<class 'scipy.sparse.csr.csr_matrix'>
As you can see, CountVectorizer
returns a sparse matrix encoding our texts with the number of times a particular word occurs.
vocabulary = {v: k for k, v in vectorizer.vocabulary_.items()}
[vocabulary[i] for i in sorted([v for k,v in vectorizer.vocabulary_.items()])]
bow.toarray()
['hello', 'one', 'play', 'test', 'world']
array([[3, 0, 1, 0, 2],
[1, 1, 0, 4, 0]])
Note that you can see how the encoding information is saved in a sparse matrix. For message0
, on indices [0,4,2] you have values [3,2,1].
If we set binary=True
when encoding messages, our encoder only records the whether the word is present or not, ignoring the numbber of occurance. For our first implementation of NaiveBayes
, we will simply encode the presence of each word.
vectorizer_b = CountVectorizer(binary=True)
bow_b = vectorizer_b.fit_transform([message0, message1])
print(bow_b)
bow_b.toarray()
(0, 0) 1
(0, 4) 1
(0, 2) 1
(1, 0) 1
(1, 3) 1
(1, 1) 1
array([[1, 0, 1, 0, 1],
[1, 1, 0, 1, 0]])
Read more: