What are the best practices or design patterns for structuring data science projects and MLOps architecture in small teams?
1. Context and Background: I work in a small data science team (<5). We exclusively develop predictive analytics solutions. We use Python, Docker, git/github, and Airflow, with everything hosted on a local server. We are aware of the limitations of our current system but have limited experience with cloud services and wish to keep overhead low.
2. Current Approach: Our current workflow looks as follows:
- We mainly develop in Python and use conda/pipenv to manage our project dependencies, as well as git/github for versioning.
- Deployment in Docker containers on a local server.
- Scheduling of all tasks using Airflow (which also runs on this server).
I / we are aware that deployment on a local server is not optimal, but we also have relatively little expertise with other tools. A pattern we often use looks like this: The project code is collected in a library ("project_a_lib"). If a model is to be trained regularly, we create a script for it ("train_model.py") which mainly contains a "main" function, and imports functions from "project_a_lib" (e.g., "load_data", "prep_data", "train_model") and arranges them in the correct order. If the training of the model is to be put into production (e.g., weekly training), we create a Docker image ("Image_project_a") containing all necessary dependencies, as well as an Airflow-DAG ("project_A_model_training.py") that creates and runs a Docker container from the image using Airflow's "DockerOperator".
3. Advantages and Disadvantages of the Current Approach:
What I like about this setup:
- the isolation of the projects through the Docker containers while simultaneously having an overview of all jobs (across all projects) at the Airflow level.
- the separation between the scheduling environment (Airflow) and the project code, created by the use of containers.
Disadvantages:
- On the other hand, I feel that we are giving up many possibilities through this setup (especially through the use of scripts instead of functions) that Airflow (but also alternatives like Prefect or Dagster) offers. For example, I could create the Airflow DAG by importing the needed functions and annotating them with a Docker decorator. However, this would require me to have my project library installed in the same environment as Airflow. Consequently, I would have to create an Airflow environment for each project, since otherwise the code from various projects would be installed in the same environment.
4. Specific Questions:
How do other small teams organize their work to manage deployment/ ML Operations?
Are there best practices / recommendations / resources to learn?