Adjust Scaling Parameters

The most effective way to reduce cold starts is maintaining warm runners using scaling parameters. These control how many runners stay alive and how quickly new ones spin up.

`keep_alive`

Default: 60 seconds Keep idle runners alive before shutdown. This applies after a runner finishes a request and when a newly started runner gets no request.

class MyApp(fal.App):
    keep_alive = 300  # Keep alive for 5 minutes

Benefits: Runners stay warm between requests, reduces cold starts for sporadic traffic Trade-offs: Longer keep_alive = higher costs; shorter = more cold starts

`min_concurrency`

Default: 0 Maintain minimum runners alive at all times, regardless of traffic.

class MyApp(fal.App):
    min_concurrency = 2  # Always keep 2 runners warm

Benefits: Guarantees warm runners always available, eliminates cold starts for baseline capacity Trade-offs: Costs money even with zero traffic

`concurrency_buffer`

Default: 0 Maintain extra runners beyond current demand.

class MyApp(fal.App):
    concurrency_buffer = 2  # Keep 2 extra runners ready

Benefits: Cushion for sudden traffic increases, reduces cold starts during bursts Trade-offs: Higher cost during all traffic levels

Takes precedence over min_concurrency when higher.

`concurrency_buffer_perc`

Default: 0 Set buffer as a percentage of current request volume.

class MyApp(fal.App):
    concurrency_buffer_perc = 20  # 20% buffer

Benefits: Scales buffer with traffic automatically Trade-offs: No buffer during zero traffic, expensive during high traffic

Actual buffer is the maximum of concurrency_buffer and concurrency_buffer_perc / 100 * request volume.

`max_multiplexing`

Default: 1 Number of concurrent requests each runner handles simultaneously.

class MyApp(fal.App):
    max_multiplexing = 4  # Each runner handles up to 4 requests

Benefits: Fewer runners needed, fewer cold starts, better resource utilization Trade-offs: Must use async handlers, each request gets fewer resources, not suitable for all workloads

`scaling_delay`

Default: 0 seconds Wait time before scaling up when a request is queued.

class MyApp(fal.App):
    scaling_delay = 30  # Wait 30 seconds before scaling

Benefits: Prevents premature scaling for brief spikes, reduces unnecessary cold starts Trade-offs: Requests wait longer during genuine traffic increases

`startup_timeout`

Default: Varies Maximum time allowed for setup() to complete.

class MyApp(fal.App):
    startup_timeout = 600  # 10 minutes for setup

Benefits: Prevents runners from being killed during long setups, accommodates large model loading Trade-offs: Doesn’t reduce cold starts (only prevents failed startups), long timeouts can mask real issues

Persistence Across Deploys

Scaling parameters set via CLI or dashboard (keep_alive, min_concurrency, concurrency_buffer, etc.) persist across deployments by default. You don’t lose your tuning when you deploy a code change. To reset all parameters back to code values, deploy with --reset-scale:

fal deploy --reset-scale

Deploy Behavior & Priority

Full explanation of how code, CLI, and dashboard settings interact

Cost Considerations

More warm runners = lower latency but higher cost. Balance based on your needs:

Latency-critical apps: Accept higher cost for warm runners (min_concurrency, keep_alive)
Cost-sensitive apps: Optimize cold start duration instead (container images, caching)
Variable traffic: Use buffers and scaling delays

Full Scaling Reference

Complete guide to scaling configuration including CLI and dashboard methods

​keep_alive

​min_concurrency

​concurrency_buffer

​concurrency_buffer_perc

​max_multiplexing

​scaling_delay

​startup_timeout

​Persistence Across Deploys

Deploy Behavior & Priority

​Cost Considerations

Full Scaling Reference

`keep_alive`

`min_concurrency`

`concurrency_buffer`

`concurrency_buffer_perc`

`max_multiplexing`

`scaling_delay`

`startup_timeout`

Persistence Across Deploys

Cost Considerations