Real-time inference uses WebSockets for persistent connections, enabling sub-100ms image generation. This is ideal for interactive applications like real-time creativity tools and camera-based inputs. Unlike queue-based inference, real-time connections bypass the queue entirely and route inputs directly to a runner. This eliminates queue wait time, and because the WebSocket maintains a persistent connection, the runner stays warm for all subsequent messages after the initial connection. The first connection may still incur a cold start if no runner is already available. Only models with an explicit real-time endpoint are supported.Documentation Index
Fetch the complete documentation index at: https://fal.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Supported Models
fast-lcm-diffusion
SDXL with Latent Consistency Models
fast-turbo-diffusion
Optimized SDXL Turbo
Quick Start
Performance Tips
For the fastest inference:- Use 512x512 input dimensions (fastest)
- Provide images as base64 encoded data URLs
- Set
sync_mode: trueto receive base64 encoded responses - 768x768 and 1024x1024 also work well, but 512x512 is optimal
Keeping API Keys Secure
WebSocket connections from browsers cannot safely embed API keys. There are two approaches for client-side authentication: a proxy URL or a token provider.Proxy URL
The simplest approach. Point the client at a server-side proxy that adds your API key:Proxy Setup
Learn how to set up a server-side proxy
Token Provider
For more control, use atokenProvider function that fetches short-lived JWT tokens from your backend. This is useful when you need per-user authentication or want to restrict which apps a token can access.
Client-side example:
tokenExpirationSeconds to enable automatic token refresh before expiry. Set it to the same value as the duration in your backend’s token request. If omitted, auto-refresh is disabled and your tokenProvider is called once at connection time.
Next.js API Route example (app/api/fal/token/route.ts):
The
tokenProvider also works for streaming with connectionMode: "client":Differences from Queue-Based Inference
Real-time WebSocket connections bypass the queue and connect directly to a runner. Several request parameters that work with queue-based inference do not apply:| Parameter | Behavior with Real-Time |
|---|---|
start_timeout | No effect. There is no queue wait |
priority | No effect. No queue ordering |
webhook_url | Not supported. Results stream back over the WebSocket |
| Automatic retries | Not available. Failed messages return errors on the connection |
X-Fal-No-Retry | No effect. No retry mechanism to disable |
Custom WebSocket Path
By default, the realtime client connects to the/realtime path on the app (e.g., wss://fal.run/fal-ai/my-app/realtime). If your app exposes a realtime endpoint at a different path, use the path option:
path parameter to realtime():
Realtime vs Streaming
Both realtime and streaming give you faster feedback than polling, but they serve different use cases.| Feature | Realtime (WebSocket) | Streaming (SSE) |
|---|---|---|
| Direction | Bidirectional (client and server) | One-way (server to client) |
| Connection | Persistent, reusable | New connection per request |
| Latency | Lower (connection reuse) | Higher (new connection each time) |
| Best for | Interactive apps, back-to-back requests | Progressive output, previews |
| Protocol | Binary msgpack (default, customizable) | JSON over SSE |
Protocol Details
The realtime client uses msgpack for binary serialization by default across all SDKs, which is more efficient than JSON for transmitting image data. In Python,realtime() and realtime_async() provide a RealtimeConnection with send() and recv() methods. In JavaScript, fal.realtime.connect() uses callback-based onResult and onError handlers.
In the JavaScript client, you can customize the message encoding by passing encodeMessage and decodeMessage options. For example, to use JSON instead of msgpack: