We have a FastAPI based web server serving an ML model. There are two endpoint in this app: /client
and /internal
We use /internal
endpoint to encode millions of documents in batches for our own internal usage. And the /client
endpoint is exposed to clients where they can request to encode their documents and it is always just one document per request.
There could be situations where we hit the internal endpoint with request to encode millions of documents and then the client hits the client endpoint to encode their one document.
Naturally we want to prioritize clients requests so that the client should not have to wait until our millions of documents are encoded.
What are the best strategies to accomplish this within a given framework of FastAPI + python. How can I prioritize clients requests to serve them faster?
I have come up with the following ideas:
- Just spin up another instance of the web server on a different port to serve clients requests - this is the easiest solution to implement;
- Isolate one gunicorn worker for the
/client
endpoint - I'm not even sure if this is possible?; - Refactor the web server with Producer/Consumer pattern implementing a priority queue. This is doable but probably will take a lot of time and research on how to implement this.
Aucun commentaire:
Enregistrer un commentaire