0

I may have the wrong stack exchange. If that's the case, could someone point me to a stack that could help with this. Anyways...

My backend employs a sentence transformer model from HuggingFace. Since the number of requests per day is small, deploying a dedicated instance for serving inference requests is not cost-effective. Hence, I was thinking of AWS Lambda whereby I could pay per inference.

However, I need to serve each request fast, which necessitates using a GPU, which AWS Lambda does not offer.

Is there a deployment solution (not necessarily at AWS) whereby I would be able to use a GPU and pay per inference?

  • Have you considered doing inference on CPU with a lightweight framework like [ggml](https://github.com/ggerganov/ggml)? – noe Aug 30 '23 at 13:45
  • @noe This is a good idea, but translating a ready model into ggml [is not trivial](https://github.com/ggerganov/ggml/issues/149#issuecomment-1556231377). There is [this](https://github.com/skeskinen/bert.cpp) project, but it does not take care of [this](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english), which is one of the models I am using. – AlwaysLearning Sep 02 '23 at 17:21

0 Answers0