2

I have a dataset with a label TRUE or FALSE for each person, but each person has multiple documents associated with them (emails and documents).

Right now I use a Random Forest Classifier on a bag of words consisting of all words in all documents put together per person (so that I have one row with all words and a label). It performs reasonably well, but I was wondering if you guys have some suggestions about how I can use the information of separate documents.

When I try to find information about this I only encounter multi-label classification, which is the exact opposite problem: multiple labels per document, instead of multiple documents per label.

Ben Reiniger
  • 11,094
  • 3
  • 16
  • 53
Tom
  • 21
  • 1
  • Have you tried to solve the problem independently and after getting the solution averaging the related documents? – Juan Esteban de la Calle May 14 '19 at 20:06
  • So you mean using each document as a unique datapoint? I was hesitant to do that, as the number of documents differs widely for each person (some people have 25 documents associated with them, and some just 3), but I can try it! – Tom May 15 '19 at 08:29

1 Answers1

0

Why don't you make a person id and add this to your model?

If I understand you correctly, you do:

$$y=\beta X$$,

where each row in $X$ are combined docs per person and $y$ is a vector of true/false, right?

You could try:

$$ y= \beta X + \gamma z$$,

where each row in $X$ is only one doc now and $z$ is a vector of ids per person (so a factor).

Might be worth a try.

Peter
  • 7,277
  • 5
  • 18
  • 47