Merging data approach in Data Science projects

Question

This is more of an infrastructural question about data science. How would you manage data merging in your GitHub repository?

As an example, as a data scientist I might be working on my branch and developing code, analysis ecc... ecc... merging code back into master is not a problem, standard software engineering work.

However, how would you manage data? How would you manage the output of the analysis/model I built? How would you resolve conflicts and how would you guarantee alignment between code and generated data?

I thought of a simple solution of having a CI pipeline that gets triggered once someone merges into master and re-runs all the code. As an example, runs the data extractions pipeline, train the model, stores the model on S3 ecc...

This way you have your data output reproduced on master, you have the guarantee of code and data alignment and it is automatic. However, for long pipelines it would mean (as an example) wait 10 hours for the data to be collected and the model to be fitted.

I have been looking for resources or possible solution online without success, it looks like a very important yet not so much discussed problem.

score 3 · Answer 1 · answered Feb 08 '23 at 16:48

@Mattia Surricchio, You can use DVC to version your data, pipelines, artifacts, and experimentation. DVC will absolutely align your code with your data and prevent de-duplication.

Regarding long pipelines and retraining, where things stay the same, (loading data for instance), DVC will skip that step if it has not changed. So if you break up your script into stages you can have your model running from only the point where you are making changes/experimentation (changing hyperparameters for example), thus saving a lot of time.

Here's a link to persisting experiments which may be useful to you: https://dvc.org/doc/user-guide/experiment-management/persisting-experiments#persisting-experiments

Also here's a blog piece on an end-to-end scenario that may shed some light on your questions, particularly if this is a Computer Vision scenario: https://iterative.ai/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines/

Hope this is helpful! Full disclosure, I'm the Community Manager at Iterative (maintainer of DVC.org, CML.dev and MLEM.ai (open source) as well as Iterative Studio (SaaS)). And feel free to join our server to get help with any questions! https://discord.com/invite/dvwXA2N

Eduard · Answer 2 · 2023-01-24T22:03:03.990

0

You could solve your problem using entity resolution/matching and deduplication techniques. By definition, given a dataset and two entities/rows, $e_1$ and $e_2$, of such dataset, entity resolution is inferring if $e_1$ matches $e_2$ or not. A similar definition could be provided for deduplication as well.

I imagine a scenario where you already have some data and would like to add new data to the old one. Then, you would like to avoid duplications since it would be like duplicating your code in a code context.

In Python, there are open-source projects such as dedupe.

You should also take a look at either DVC or GX for other data quality-related issues.

edited Jan 24 '23 at 22:03

answered Jan 22 '23 at 21:46

Eduard

659
2
10

Yes, I know DVC pretty well but, isn't it thought for versioning? I mean, imagine I have my master branch with some data artefacts, I create a branch and do my code there that generates fresh new data (new output that was not present in the master branch) or updates the existing one. I can easily see how to solve the second case -> versioning. But what about the first? How would you merge that data into master? I think there are two approaches: just copy that data as it is or re-run the whole code on master and regenerate the artefacts – Mattia Surricchio Jan 23 '23 at 08:23
However, the first approach is quite slow for long and big data science pipelines while the second one does not guarantee you an alignment between code and the underlying generated data (I could merge data that does not come from any point of my code) – Mattia Surricchio Jan 23 '23 at 08:24
I have edited my answer, hoping that it helps. – Eduard Jan 24 '23 at 22:03

Merging data approach in Data Science projects

2 Answers2