Best practices for serving user-specific large models in a web application?

Question

First execuse any naive statement you may find below, i'm a newcomer to the field.

How do web applications that integrate fine-tuning of large machine learning/deep learning models handle the storage and retrieval of these models for inference?

I'm trying to implement a web app that allows users to fine-tune a stable diffusion model using their own images with dreambooth. as the fine-tuned model is quite large reaching several gigabytes. After the model is trained and saved, the app should retrieve and use the model for inference each time a user visits the site and requests one.

The current approach I am considering is to store the fine-tuned model in a compressed format in a S3 or R2 bucket. Each time a user visits the web app and requests an inference, I would retrieve the model from the bucket, decompress it, and run the inference.

that being said adding the overhead of fetching + decompression to inference is obviously not a good idea.

I'm sort of sure that there's a standard approach that the machine learning community follows for handling such scenarios, what are those if they exist ? how typically these scenarios are handled ?

score 0 · Accepted Answer · answered Feb 04 '23 at 08:42

No idea about standard approaches but one option you have is: instead of fine-tuning the whole model, fine-tune only a part of it. For instance, you may fine-tune only the last layers few layers. This way, you can keep loaded the common part of the model, load just the small fine-tuned part and combine them to perform inference.

This would reduce both storage space and decompression time, at the cost of more complex code logic.

Of course, you should first determine what are the minimum fine-tuned parts of the model that let you get the desired output quality.

Best practices for serving user-specific large models in a web application?

1 Answers1