I may have the wrong stack exchange. If that's the case, could someone point me to a stack that could help with this. Anyways...
My backend employs a sentence transformer model from HuggingFace. Since the number of requests per day is small, deploying a dedicated instance for serving inference requests is not cost-effective. Hence, I was thinking of AWS Lambda whereby I could pay per inference.
However, I need to serve each request fast, which necessitates using a GPU, which AWS Lambda does not offer.
Is there a deployment solution (not necessarily at AWS) whereby I would be able to use a GPU and pay per inference?