Imputing Missing Text Categories in Python Using Word Embeddings/Machine Learning/LLM

Question

I'm working on a dataset where each row represents an entity with several attributes. The dataset includes fields such as 'id', 'category_name', 'text_content', 'created_at', 'last_updated_at', 'title', 'num_highlights', 'url', and 'external_url'. I've noticed that a significant number of rows have missing values in the 'category_name' column, and my goal is to impute these missing values.

Here's a brief snapshot of the data:

	id	category_name	text_content	created_at	last_updated_at	title	num_highlights	url	external_url
0	id_1	None	Text content 1...	timestamp_1	timestamp_1	Title 1	6	URL	URL
1	id_2	None	Text content 2...	timestamp_2	timestamp_2	Title 2	4	URL	URL

Here's the approach I'm currently considering:

Create word embeddings for all documents: I plan to represent the 'text_content' of each document as a dense vector, which captures the semantic meaning of the document.
Topic Modeling: I plan to do topic modeling using OpenAI + Prompting to identify the topics in the docs. For each category, I'll aggregate the topics of its documents to form a topic profile for the category.
Assign New Documents to Categories: When a new document comes in, I'll use the topic model to determine its topics and find the category with the most similar topic profile. If the assignment is not confident enough, I'll ask the user to assign the category and update the topic profiles accordingly.

I'm curious to know if anyone has suggestions or improvements for this approach. Additionally, recommendations for specific libraries or models that would be particularly suitable for this task would be highly appreciated.

Imputing Missing Text Categories in Python Using Word Embeddings/Machine Learning/LLM

0 Answers0