I've N (~50) number of sentiment models of different languages, which were fine tuned on HggingFace's transformer models. Each of the models as 2-3 GB in size approx. Now, how can I deploy all these sentiment models as a scalable service in a cloud platform like GCP, so that the bill is optimized and the service performance (low inference time, or latency) is maximized.
Currently we're deploying each of our models as a separate service. For each of the model we're following the below steps.
- Develop the service using Flask: We write the code for our service, including routes and logic for handling requests.
- Create a Dockerfile: A docker file is created to build a Docker image of our service.
- Build the Docker image: We build the Docker image of our service.
- Push the Docker image to GCR: We create a new repository in GCR and push the Docker image to it.
- Create a GKE Cluster: We go to the Kubernetes Engine console and create a new cluster. Select the appropriate number of nodes and configure the desired resources.
- Create a GKE Deployment: We create a new deployment and associate it with the image from our GCR repository and configure the desired number of replicas.
- Create a Cloud Load Balancer: We go to the Google Cloud Console and create a new Cloud Load Balancer. Select the GKE deployment we created in step 6 as the target for the Load Balancer.
- Update your DNS to point to the Load Balancer: Then we update our DNS settings to point to the IP address of the Load Balancer created in step 7.
- Monitor the service: We use Stackdriver to monitor the service and ensure that it is running smoothly and that the desired number of replicas are running.
- Scale the service: When necessary, we use the Auto Scaling feature of GKE to automatically scale the number of replicas running your micro-service based on incoming traffic or other metrics.
We follow the same steps for each of our models and deploy the models as a dedicated service. However, this approach costs us a lot of money at the end of the month.
So, suggest me a better way to deploy such multiple models as a service in a scalable manner so that the cloud bill is optimized, but the performance is maximized.