Lets specify the question with the help of the figure below: We know that one part of the behaviour (our target Y) will depend on common parameters (for the group). It is represented by the grey zone on the figure. And that one part is parameters specific to each individual. ( It is represented by the pink and blue zone )
The precise question is:
Knowing that we have data from the entire group. How to use all the data of the groups, to create a specific model for an individual of this group? The idea is to get a solid model/result because it is based on all the available data, but still specific to an individual. I imagine an answer in the form of a short list of techniques to achieve that.
Let’s illustrate the question with an example:
We do a study of 100 people. Let’s name the people [0, 1, 2, 3, 4, 5 … 99] For each person, we do a study of 30 days. Every day, we do 5 measurements of the emotion of the person (X). And then a measure of the quantify of calories in the evening meal (Y).
In this example, the goal, using machine learning, is knowing X on day 55 (after the tests), for person number 3, to be able to predict Y (on day 55) for a given person. I often came across this problem. With my experiments, tests and research, I see two possibilities:
The first option is to take all the samples. 100 people * 30 measurements (of 5 samples x and y) = 3000 points. We create a model that connects x to y. Then we take the new x (for person number 3), and ‘predict’ y. We then have an answer that takes into account all the observations but is not the specificity of the person number 3. The model and the answer are kind of an ‘average behavioural response’ for the population. This model corresponds more to the grey zone of the figure you see above.
The second option is to take only information from person 3 to create the model. We have 30 points. We create a model that connects x to y. The model is really specific to person 3. Since the model uses fewer points, it is less accurate. We did not use our total knowledge/data of the grey zone.
I have the deep intuition that there is a way to do better than these two options. I tried a lot but without satisfying success.
I tried, with random forest algorithms, to use all data at the same time and add an id in X vectors. This id represents the individual. It does not seem to have worked. The internet searches I have done on this topic give me unrelated results. Any help or keywords on this topic are welcome.
Is it useful or necessary that I share an example dataset?
