Tokenization of data in dataframe in python

Question

I am performing tokenization to each row in my dataframe but the tokenization is being done for only the first row. Can someone please help me. thank you. Below are my codes:


import pandas as pd
import json
import nltk

nltk.download('punkt')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize


with open(r"C:\Users\User\Desktop\Coding\results.json" , encoding="utf8") as f:
     data = json.load(f)
df=pd.DataFrame(data['part'][0]['comment'])
split_data = df["comment"].str.split(" ")
data = split_data

print(data)

def tokenization_s(data): # same can be achieved for words tokens
    s_new = []
    for sent in (data[:][0]): #For NumpY = sentences[:]
        s_token = sent_tokenize(sent)
        if s_token != '':
            s_new.append(s_token)
    return s_new

print(tokenization_s(data))

My output is:

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
0                             [enjoy, a, lovely, moment]
1      [I, was, there, for, my, honeymoon., The, hote...
2      [Had, an, amazing, stay, for, 2, nights.\nThe,...
3                 [Had, a, good, time., Food, is, good.]
4      [A, highly, recommendable, hotel., Value, for,...
                             ...                        
131    [Wonderful, experience,, a, quite, different, ...
132                            [Was, a, paradise, stay.]
133    [It, was, really, a, place, to, be, for, relax...
134    [It, was, just, perfect, with, an, excellent, ...
135                               [It's, was, excellent]
Name: comment, Length: 136, dtype: object
[['enjoy'], ['a'], ['lovely'], ['moment']]

Process finished with exit code 0

What should I do for the system to tokenise each row in the dataframe?

score 3 · Answer 1 · answered Feb 13 '20 at 04:24

You can try with this:

import pandas as pd
import nltk

df = pd.DataFrame({'frases': ['Do not let the day end without having grown a little,', 'without having been happy, without having increased your dreams', 'Do not let yourself be overcomed by discouragement.','We are passion-full beings.']})

df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['frases']), axis=1)

Tokenization of data in dataframe in python

1 Answers1