Questions tagged [version-control]
14 questions
63
votes
9 answers
Tools and protocol for reproducible data science using Python
I am working on a data science project using Python.
The project has several stages.
Each stage comprises of taking a data set, using Python scripts, auxiliary data, configuration and parameters, and creating another data set.
I store the code in…
Yuval F
- 761
- 1
- 6
- 7
59
votes
9 answers
How to deal with version control of large amounts of (binary) data
I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have…
Johann
- 701
- 1
- 5
- 5
7
votes
1 answer
A the end of a big DS project, should I make trained models available on GitHub?
I almost completed two big Data Science personal projects based on Deep Learning. They are the fanciest models I've implemented up to now, and I'm pushing all my code on GitHub.
Do you advice to upload trained models too? Or should I let other users…
Leevo
- 6,005
- 3
- 14
- 51
4
votes
2 answers
Merging data approach in Data Science projects
This is more of an infrastructural question about data science. How would you manage data merging in your GitHub repository?
As an example, as a data scientist I might be working on my branch and developing code, analysis ecc... ecc... merging code…
Mattia Surricchio
- 401
- 2
- 5
- 12
2
votes
1 answer
What is the difference between Pachyderm and Git?
I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that:
It holds all your data in a central accessible location
It updates all depending data sets when…
Lerner Zhang
- 496
- 3
- 10
2
votes
1 answer
What is the right way to store datasets for a CNN project
Our image classification project has thousands of raw photos, masks and reshaped images. We store source code in git. But datasets don't belong to source code version control. How should we store thee sets of images?
sixtytrees
- 191
- 1
- 6
2
votes
1 answer
Dataset management: What are some strategies/solutions for efficiently storing datasets with their versions?
The problem: I've N classification models (independent), for each of these N models, I've different versions (eg: V0, V1, ..., Vfinal_production,Vexperimental). I'm looking for a way to store my datasets efficiently on the cloud (for…
ngub05
- 333
- 1
- 2
- 8
1
vote
0 answers
Embedding git commit into the resulting data
Our pipeline works something like that:
Collect bunch of raw data (10-100 GB) from microscope
Process data using MATLAB scripts
Change few parameters based on raw data, as well as add new features to the scripts
Commit the scripts with new features…
1
vote
0 answers
Keras trained model exported with older version of Keras ( < 2.2.0 )
Is it possible to update a trained model saved in a file without retraining it ?
I found the model on the web and I would like to use it but it uses Merge layers which are not supported by newer version of Keras, making it impossible to load with…
Nicolas Scotto Di Perto
- 131
- 3
1
vote
0 answers
Suggestion on practice to model and dataset version documentation
I want to steer my question towards the practical side of ML. As a practitioner, I feel keeping different versions of models and datasets is difficult. From time to time I need to revisit my data and model code to verify if certain assumptions are…
Student
- 419
- 2
- 9
0
votes
1 answer
How to version data science projects with large files
I am working on a project with large data files (~300MB). I want to version my work along with the data files so that it is always available online. I tried using git-lfs but it has a 1GB/month bandwidth limit, beyond which you're blocked for a…
fireball.1
- 103
- 4
0
votes
1 answer
I cannot run MNIST MWE (hello world for DL)
I have installed Anaconda and want tor run MWE for MNIST but I'me getting this error:
D:\STAZENE_last\Anaconda2\Lib\site-packages\torch\cuda\__init__.py:107:
UserWarning: CUDA initialization: The NVIDIA driver on your system is
too old (found…
user2925716
- 101
0
votes
0 answers
version control for code and output models
I have a question about version control for both code and the models it generates. We are developing ML models that often involve hyperparameters and so we might do many runs with different hyperparameter settings. We currently store the output…
jmuller
- 1
-3
votes
1 answer
Extract all releases from GIT repository
I would like to examine an existing Git repository and extract all defined releases into a subfolder.
For example, if application A had 26 releases, my bash script would extract all 26 versions into subfolders such as:
A/(folder) for each of the…
user1928436
- 21
- 3