Text classification with multiple documents per labeled datapoint

Question

I have a dataset with a label TRUE or FALSE for each person, but each person has multiple documents associated with them (emails and documents).

Right now I use a Random Forest Classifier on a bag of words consisting of all words in all documents put together per person (so that I have one row with all words and a label). It performs reasonably well, but I was wondering if you guys have some suggestions about how I can use the information of separate documents.

When I try to find information about this I only encounter multi-label classification, which is the exact opposite problem: multiple labels per document, instead of multiple documents per label.

Have you tried to solve the problem independently and after getting the solution averaging the related documents? — Juan Esteban de la Calle, May 14 '19 at 20:06
So you mean using each document as a unique datapoint? I was hesitant to do that, as the number of documents differs widely for each person (some people have 25 documents associated with them, and some just 3), but I can try it! — Tom, May 15 '19 at 08:29

score 0 · Answer 1 · answered Aug 30 '19 at 15:22

0

Why don't you make a person id and add this to your model?

If I understand you correctly, you do:

$$y=\beta X$$,

where each row in $X$ are combined docs per person and $y$ is a vector of true/false, right?

You could try:

$$ y= \beta X + \gamma z$$,

where each row in $X$ is only one doc now and $z$ is a vector of ids per person (so a factor).

Might be worth a try.

answered Aug 30 '19 at 15:22

Peter

7,277
5
18
47

How could this generalise to unseen persons? – timleathart Jan 28 '20 at 08:06

Text classification with multiple documents per labeled datapoint

1 Answers1