I am working on a binary classification model (healthy/diseased) based on gene expression data of different patients. As a second task, I would like to stratify these patients and find subgroups. I expect that the summary pattern of different genes within an experiment will be the strongest predictor of the outcome (differential coexpression analysis). How do I deal with the importance of the group-setup in my ML model if I need to follow the rule not to include IDs (in my case experiment IDs) in a model?
Also, I have repeated measures of the same patients and also hope for significant differences between some patient groups - does that mean I should just include the patient IDs as well, or pre-define some groups, or use all patient characteristics that could be interesting as features?
This is how my data is currently organized:
| experiment ID | gene | expression | patient ID | label |
|---|---|---|---|---|
| 1 | A | 11 | 1234 | healthy |
| 1 | B | 5 | 1234 | healthy |
| 2 | A | 3 | 4356 | diseased |
| 2 | B | 9 | 4356 | diseased |
| 3 | A | 13 | 1234 | healthy |
| 3 | B | 6 | 1234 | healthy |