I'm working on a dataset where each row represents an entity with several attributes. The dataset includes fields such as 'id', 'category_name', 'text_content', 'created_at', 'last_updated_at', 'title', 'num_highlights', 'url', and 'external_url'. I've noticed that a significant number of rows have missing values in the 'category_name' column, and my goal is to impute these missing values.
Here's a brief snapshot of the data:
| id | category_name | text_content | created_at | last_updated_at | title | num_highlights | url | external_url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | id_1 | None | Text content 1... | timestamp_1 | timestamp_1 | Title 1 | 6 | URL | URL |
| 1 | id_2 | None | Text content 2... | timestamp_2 | timestamp_2 | Title 2 | 4 | URL | URL |
Here's the approach I'm currently considering:
- Create word embeddings for all documents: I plan to represent the
'text_content'of each document as a dense vector, which captures the semantic meaning of the document. - Topic Modeling: I plan to do topic modeling using OpenAI + Prompting to identify the topics in the docs. For each category, I'll aggregate the topics of its documents to form a topic profile for the category.
- Assign New Documents to Categories: When a new document comes in, I'll use the topic model to determine its topics and find the category with the most similar topic profile. If the assignment is not confident enough, I'll ask the user to assign the category and update the topic profiles accordingly.
I'm curious to know if anyone has suggestions or improvements for this approach. Additionally, recommendations for specific libraries or models that would be particularly suitable for this task would be highly appreciated.