1

I've N (~50) number of sentiment models of different languages, which were fine tuned on HggingFace's transformer models. Each of the models as 2-3 GB in size approx. Now, how can I deploy all these sentiment models as a scalable service in a cloud platform like GCP, so that the bill is optimized and the service performance (low inference time, or latency) is maximized.

Currently we're deploying each of our models as a separate service. For each of the model we're following the below steps.

  1. Develop the service using Flask: We write the code for our service, including routes and logic for handling requests.
  2. Create a Dockerfile: A docker file is created to build a Docker image of our service.
  3. Build the Docker image: We build the Docker image of our service.
  4. Push the Docker image to GCR: We create a new repository in GCR and push the Docker image to it.
  5. Create a GKE Cluster: We go to the Kubernetes Engine console and create a new cluster. Select the appropriate number of nodes and configure the desired resources.
  6. Create a GKE Deployment: We create a new deployment and associate it with the image from our GCR repository and configure the desired number of replicas.
  7. Create a Cloud Load Balancer: We go to the Google Cloud Console and create a new Cloud Load Balancer. Select the GKE deployment we created in step 6 as the target for the Load Balancer.
  8. Update your DNS to point to the Load Balancer: Then we update our DNS settings to point to the IP address of the Load Balancer created in step 7.
  9. Monitor the service: We use Stackdriver to monitor the service and ensure that it is running smoothly and that the desired number of replicas are running.
  10. Scale the service: When necessary, we use the Auto Scaling feature of GKE to automatically scale the number of replicas running your micro-service based on incoming traffic or other metrics.

We follow the same steps for each of our models and deploy the models as a dedicated service. However, this approach costs us a lot of money at the end of the month.

So, suggest me a better way to deploy such multiple models as a service in a scalable manner so that the cloud bill is optimized, but the performance is maximized.

1 Answers1

1

A couple of ideas:

  • Reduce the number of models.
  • Reduce the size of the models through distillation, quantization, and pruning.
  • Reduce the size of the machine type within the cluster.
  • Confirm the system downscales to zero when not being used.
Brian Spiering
  • 20,142
  • 2
  • 25
  • 102