3

I know that TFIDF is an NLP method for feature extraction.

and I know that there are libraries that calculate TFIDF directly from the text.

This is not what I want though

In my case, my text dataset has been converted into Bag of words

The original dataset that I "DO NOT" have access to, looks like this

RepID     RepText
------------------
1         Doctor sys patient has diabetes and needs rest for ...
2         Patients history: broken arm, and ...
3         A dose of Metformin 2 times a day ...
4         Xray needed for the chest...
5         Covid-19 expectation and patient should have a rest ...

But my dataset looks like this

RepID   Word         BOW
-------------------------
1       Doctor       3
1       diabetes     4
1       patient      1
.       .            .
.       .            .
2       patient      2
2       arm          7
.       .            .
.       .            .
5684    cough        9
5684    Xray         3
5684    Covid        5
.       .            .
.       .            .

What I want is to find TFIDF for each word in my dataset.

I was thinking of converting my dataset into a unstructured format

so it would look like this

RepID     RepText
------------------
1         Doctor Doctor Doctor diabetes diabetes diabetes diabetes ...
2         Patients patients arm arm arm arm arm arm arm ...
.
.
5684      cough cough cough cough cough cough cough cough cough Xray Xray

so each word repeated the same number of BOW

but I do not think this is the best way to do as I convert a structured dataset into an unstructured one..

How to find the TFIDF from the structured dataset? is there a library or algorithm for that?

Note :

Dataset stored in MS SQL Server, and I am using Python code.

asmgx
  • 539
  • 2
  • 17
  • You could also calculate the TF and IDF values directly from the data but it's probably a bit more work than the proposed answer: (1) collect all the unique words and for each word store in a map in how many documents they appear (that's the doc frequency DF), (2) for each doc create a vocabulary-length vector where each position represents a word and the value = TF * log(1/DF), where TF is the count of this word divided by the total count in the document. – Erwan May 22 '21 at 21:54

1 Answers1

3

You could use pandas pivot_table() to transform your data frame into a count matrix, and then apply sklearn TfidfTransformer() to the count matrix in order to obtain the tf-idfs.

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer

# input data
df = pd.DataFrame({
    'RepID': [1, 1, 1, 2, 2, 5684, 5684, 5684],
    'Word': ['Doctor', 'diabetes', 'patient', 'patient', 'arm', 'cough', 'Xray', 'Covid'],
    'BOW': [3, 4, 1, 2, 7, 9, 3, 5]
})

# count matrix
df = pd.pivot_table(df, index='RepID', columns='Word', values='BOW', aggfunc='sum')
df = df.fillna(value=0)
print(df)
# Word   Covid  Doctor  Xray  arm  cough  diabetes  patient
# RepID
# 1        0.0     3.0   0.0  0.0    0.0       4.0      1.0
# 2        0.0     0.0   0.0  7.0    0.0       0.0      2.0
# 5684     5.0     0.0   3.0  0.0    9.0       0.0      0.0

# tf-idf transform
X = TfidfTransformer().fit(df.values)
print(X.idf_)
# [1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207]