This is more of an infrastructural question about data science. How would you manage data merging in your GitHub repository?
As an example, as a data scientist I might be working on my branch and developing code, analysis ecc... ecc... merging code back into master is not a problem, standard software engineering work.
However, how would you manage data? How would you manage the output of the analysis/model I built? How would you resolve conflicts and how would you guarantee alignment between code and generated data?
I thought of a simple solution of having a CI pipeline that gets triggered once someone merges into master and re-runs all the code. As an example, runs the data extractions pipeline, train the model, stores the model on S3 ecc...
This way you have your data output reproduced on master, you have the guarantee of code and data alignment and it is automatic. However, for long pipelines it would mean (as an example) wait 10 hours for the data to be collected and the model to be fitted.
I have been looking for resources or possible solution online without success, it looks like a very important yet not so much discussed problem.