How to customize word division in CountVectorizer?

Question

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> import numpy
>>> import pandas

>>> vectorizer = CountVectorizer()
>>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud']
>>> xtrain = vectorizer.fit_transform(corpus1)
>>> xtrain
<3x6 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>

>>> xtraindf = pd.DataFrame(xtrain.toarray())
>>> xtraindf.columns = vectorizer.get_feature_names()
>>> xtraindf.columns
Index(['123', 'abc', 'cde', 'hud', 'jhg', 'true'], dtype='object')

I see that the special characters(-@@-) are omitted and "abc" and "123" are considered seperately. But, I want "abc-@@-123" to be treated as a single word. Is it possible to achieve? If yes, how?

Any help would be much appreciated.

score 3 · Accepted Answer · answered Jun 14 '18 at 15:42

It's possible if you define CountVectorizer's token_pattern argument.

If you're new to regular expressions, Python's documentation goes over how it deals with regular expressions using the re module (and scikit-learn uses this under the hood) and I recommend using an online regex tester like this one, which gives you immediate feedback on whether your pattern captures precisely what you want.

token_pattern expects a regular expression to define what you want the vectorizer to consider a word. An example for the string you're attempting to match would be this pattern, modified from the default regular expression that token_pattern uses:

(?u)\b\w\w+\-\@\@\-\w+\b

Applied to your example, you would do this

vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\-\@\@\-\w+\b')
corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud']
xtrain = vectorizer.fit_transform(corpus1)
xtraindf = pd.DataFrame(xtrain.toarray())
xtraindf.columns = vectorizer.get_feature_names()

Which returns

Index(['abc-@@-123', 'cde-@@-true', 'jhg-@@-hud'], dtype='object')

An important note here is that it will always expect your words to have -@@- nested in your tokens. For instance:

corpus2 = ['abc-@@-123','cde-@@-true','jhg-@@-hud', 'unexpected']
xtrain = vectorizer.fit_transform(corpus2)
xtraindf = pd.DataFrame(xtrain.toarray())
xtraindf.columns = vectorizer.get_feature_names()
print(xtraindf.columns)

Would give you

Index(['abc-@@-123', 'cde-@@-true', 'jhg-@@-hud'], dtype='object')

If you need to match words that don't have that exact special character structure, you can wrap the string of special characters in a group and use the non-matching group modifier ?:

more_robust_vec = CountVectorizer(token_pattern=r'(?u)\b\w\w+(?:\-\@\@\-)?\w+\b')
xtrain = more_robust_vec.fit_transform(corpus2)
xtraindf = pd.DataFrame(xtrain.toarray())
xtraindf.columns = more_robust_vec.get_feature_names()
print(xtraindf.columns)

Which prints

Index(['abc-@@-123', 'cde-@@-true', 'jhg-@@-hud', 'unexpected'], dtype='object')

I hope this helps!

Thank you for explaining in such a great detail. It helped me to understand the concept better. :) — helloworld, Jun 14 '18 at 16:17

How to customize word division in CountVectorizer?

1 Answers1