I realize that this could be considered a duplicate of this question, Is using samples from the same person in both trainset and testset considers being a data leakage?, where it is stated that "The testing data should not be linked to the training data in any way" in order to prevent data leakage. However, how should I continue if it is not possible to split the dataset, in a train and testset, where there is no link between them in any way?
My dataset exists of 839 time series samples which i want to classify according to 14 classes. However, this is an unbalanced dataset as can be seen in Figure 1. The dataset is composed of the recordings of 16 volunteers. Unfortunately, for a number of these classes, data is only available for a number of volunteers. Figure 2 shows that for some classes only data of one specific volunteer is available. In the figure, the number gives the amount of volunteers of which samples are available for that specific class.
I want to validate my machine learning models using a seperate testset. My initial idea was to leave the data of one volunteer out for the testset. But as stated above, this is not possible since for two classes there is only data available from one volunteer. The second best approach that I could come up with is leaving the data from one volunteer (which has samples available for 12 of the 4 classes) out alongside a selected few samples of the two minority classes that are not present in the samples of this specific volunteer. This means that my testset would consist of samples which are completely seperated from the trainset for 12 of the 14 classes. The samples of the testset for the 2 other classes would be from the same volunteer on which the models are trained (since there is only data available of one volunteer for these classes).
As stated above, I know that this will introduce some form of data leakage. Is there a good alternative that won't introduce data leakage (besides adding more data)? If not, would the StratifiedKFold method from sklearn be a good alternative for validating my model? (With the StratifiedKFold i would use a pipeline in order to perform preprocessing only on the training part of the folds.)
I would be really glad for an answer and some more explanation on data leakage for this specific case.