What is the best approach to deploy N number of ML models as a scalable service in the Cloud?

Question

I've N (~50) number of sentiment models of different languages, which were fine tuned on HggingFace's transformer models. Each of the models as 2-3 GB in size approx. Now, how can I deploy all these sentiment models as a scalable service in a cloud platform like GCP, so that the bill is optimized and the service performance (low inference time, or latency) is maximized.

Currently we're deploying each of our models as a separate service. For each of the model we're following the below steps.

Develop the service using Flask: We write the code for our service, including routes and logic for handling requests.
Create a Dockerfile: A docker file is created to build a Docker image of our service.
Build the Docker image: We build the Docker image of our service.
Push the Docker image to GCR: We create a new repository in GCR and push the Docker image to it.
Create a GKE Cluster: We go to the Kubernetes Engine console and create a new cluster. Select the appropriate number of nodes and configure the desired resources.
Create a GKE Deployment: We create a new deployment and associate it with the image from our GCR repository and configure the desired number of replicas.
Create a Cloud Load Balancer: We go to the Google Cloud Console and create a new Cloud Load Balancer. Select the GKE deployment we created in step 6 as the target for the Load Balancer.
Update your DNS to point to the Load Balancer: Then we update our DNS settings to point to the IP address of the Load Balancer created in step 7.
Monitor the service: We use Stackdriver to monitor the service and ensure that it is running smoothly and that the desired number of replicas are running.
Scale the service: When necessary, we use the Auto Scaling feature of GKE to automatically scale the number of replicas running your micro-service based on incoming traffic or other metrics.

We follow the same steps for each of our models and deploy the models as a dedicated service. However, this approach costs us a lot of money at the end of the month.

So, suggest me a better way to deploy such multiple models as a service in a scalable manner so that the cloud bill is optimized, but the performance is maximized.

score 1 · Answer 1 · answered Jan 18 '23 at 00:00

1

A couple of ideas:

Reduce the number of models.
Reduce the size of the models through distillation, quantization, and pruning.
Reduce the size of the machine type within the cluster.
Confirm the system downscales to zero when not being used.

answered Jan 18 '23 at 00:00

Brian Spiering

20,142
2
25
102

What is the best approach to deploy N number of ML models as a scalable service in the Cloud?

1 Answers1