First execuse any naive statement you may find below, i'm a newcomer to the field.
How do web applications that integrate fine-tuning of large machine learning/deep learning models handle the storage and retrieval of these models for inference?
I'm trying to implement a web app that allows users to fine-tune a stable diffusion model using their own images with dreambooth. as the fine-tuned model is quite large reaching several gigabytes. After the model is trained and saved, the app should retrieve and use the model for inference each time a user visits the site and requests one.
The current approach I am considering is to store the fine-tuned model in a compressed format in a S3 or R2 bucket. Each time a user visits the web app and requests an inference, I would retrieve the model from the bucket, decompress it, and run the inference.
that being said adding the overhead of fetching + decompression to inference is obviously not a good idea.
I'm sort of sure that there's a standard approach that the machine learning community follows for handling such scenarios, what are those if they exist ? how typically these scenarios are handled ?