Every model on fal can be called through the same set of inference methods, whether it is a pre-trained model from the gallery or your own app running on Serverless. This page walks through each method, explains when to reach for it, and links to the deeper reference pages. Before calling any model you will need an API key and the fal client installed in your project. fal provides five calling patterns that cover the spectrum from quick prototyping to high-throughput production pipelines. All of them benefit from fal’s autoscaling infrastructure, where runners spin up on demand to handle your requests. The key decision is whether you need the queue (recommended for reliability), a direct call (simplest path), streaming (progressive output), or a real-time WebSocket connection (lowest latency). Each method is covered below with a code example and guidance on when it is the right fit.Documentation Index
Fetch the complete documentation index at: https://fal.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
| Method | How it works |
|---|---|
run() | Direct HTTP call, no queue |
subscribe() | Queue-based, blocks until result |
submit() | Queue-based, returns immediately (recommended) |
stream() | Progressive output via SSE |
realtime() | WebSocket, persistent connection |
Direct (run)
The simplest way to call a model. Sends a direct HTTP request to fal.run and returns the result. No queue, no retries, no polling.
run for quick scripts, prototyping, or any model with fast response times where you want the lowest overhead. Because there is no queue involved, the call goes straight to a runner and returns the response directly. The tradeoff is that direct calls do not retry on failure. If the runner returns an error or times out, you get the error immediately.
Learn more
Direct and queue-backed synchronous calls
Subscribe (Queue-backed synchronous)
Likerun, but uses the queue under the hood. Submits a request, polls automatically, and blocks until the result is ready. You get automatic retries and reliability with a simple interface.
subscribe is a good choice when you want the simplicity of a blocking call combined with queue-backed reliability. It handles polling for you, so the code looks almost identical to run, but your request is durable and will be retried if a runner fails. Reach for it in simple integrations, backend scripts, or anywhere you do not need to manage the request lifecycle yourself.
Learn more
Direct and queue-backed synchronous calls
Asynchronous (submit)
The recommended approach for production. Submit a request to the queue and return immediately, then poll for status or receive results via webhook.
Polling:
Status types
Thehandler.status() method returns one of three types. Pass with_logs=True to include runner logs.
| Type | Fields | Meaning |
|---|---|---|
Queued | position (int) | Waiting in queue. position is how many requests are ahead. |
InProgress | logs (list or None) | A runner is processing the request. logs contains messages if with_logs=True. |
Completed | logs (list or None), metrics (dict) | Result is ready. metrics includes inference_time in seconds. |
InQueueQueueStatus, InProgressQueueStatus, and CompletedQueueStatus. See the full Python SDK reference and JavaScript SDK reference for details.
Webhook (no polling needed):
Learn Async Inference
The recommended way to call models at scale
Streaming (stream)
For models that produce output progressively. Each event arrives as it is generated, so you can display partial results without waiting for the full response. This is useful for showing image generation previews or streaming LLM tokens.
The
stream() method connects to the /stream path on the model endpoint. Not all models support streaming. Check the model’s API documentation for availability.Learn Streaming
Receive output as it’s generated
Real-time (realtime)
For interactive applications that need the lowest possible latency. Opens a persistent WebSocket connection to a warm runner, enabling back-to-back requests without reconnection overhead. Only available for models with an explicit real-time endpoint.
Learn Real-time
WebSocket connections for interactive apps
Getting Started
Before calling any model, install and configure the fal client for your language. If you are building a browser-based app, you will also need a server-side proxy to keep your API key out of client-side code.Client Setup
Install and configure the fal client
Proxy Setup
Keep your API key secure in client-side apps