When you scale to multiple runners, fal normally treats them as interchangeable and routes requests to any available runner. This works well when every runner is identical, but falls short when runners hold different state. For example, an app that serves multiple diffusion models but can only keep a few loaded in GPU memory at once. Routing a request for FLUX.1 to a runner that already has FLUX.1 loaded avoids a costly model swap, while routing it to a runner with a different model loaded means waiting for a full model load. Routing hints solve this with a two-sided mechanism. On the caller side, you pass aDocumentation Index
Fetch the complete documentation index at: https://fal.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
hint parameter (sent as the X-Fal-Runner-Hint header) that describes what the request needs. On the server side, your app implements a provide_hints() method that tells fal what each runner is currently specialized for. When both are present, fal’s router tries to match requests to runners with compatible hints. If no matching runner is available or all matching runners are busy, the request goes to any available runner without waiting. Hints are best-effort: they improve cache hit rates but never block a request from being processed.
How It Works
The router matches the hint string from the caller against the list of strings each runner reports viaprovide_hints(). The matching is exact: if the caller sends hint="flux-schnell" and a runner’s provide_hints() returns ["flux-schnell", "sd-xl"], that runner is preferred. If no runner has a matching hint, the request goes to any available runner.
provide_hints() is called after every response and the result is sent back to the platform as a response header. This means hints update dynamically as the runner loads and unloads models. A runner that starts empty will initially match any request, and as it loads models, it becomes specialized for those models.
Example
This app serves any Hugging Face diffusion model by name. Each runner maintains a cache of loaded models. The hint is the model name, which matches whatprovide_hints() reports.
Application
provide_hints() returns an empty list, so it matches any request. As it loads models, the hints update to include the loaded model names. Over time, the router naturally specializes runners by directing repeat requests for the same model to the same runner.