GPU hardware can degrade or fail during operation. Thermal throttling, memory errors, driver issues, and hardware faults can all cause a runner to produce incorrect results or hang silently. fal monitors GPU health at the platform level and provides tools for you to add application-level health checks that detect problems specific to your workload. Platform-level monitoring runs continuously across all runners and reacts to hardware problems without any configuration on your part. Application-level health checks are optional but recommended for production apps, especially those that hold state or are sensitive to GPU degradation. Together, these ensure that unhealthy runners are replaced before they affect your users.Documentation Index
Fetch the complete documentation index at: https://fal.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Platform GPU Monitoring
fal continuously monitors GPU metrics across all runners, including temperature, clock frequencies, and throttling events. When issues are detected, the operations team is alerted and can cordon the node (preventing new runners from being scheduled), drain existing runners, and perform GPU resets. If the issue persists, the node is escalated for hardware replacement. This monitoring runs automatically and requires no configuration. You benefit from it regardless of whether you have custom health checks enabled.Application Health Checks
Platform monitoring catches hardware-level failures, but it cannot detect application-level problems like a corrupted model state, a leaked GPU memory allocation, or an external dependency that went down. For these, you can define a health check endpoint that fal calls periodically to verify your runner is functioning correctly.failure_threshold consecutive calls (default 3), the runner is terminated and replaced. Health checks run every 15 seconds when call_regularly=True.
Non-Invasive vs Invasive Checks
Health checks withcall_regularly=True run in parallel with request processing. Keep these lightweight since they share GPU and CPU resources with active requests. Check connection status, memory usage, or simple assertions rather than running inference.
For more thorough checks that need exclusive GPU access (e.g., running a test inference), set call_regularly=False. In this mode, the health check only runs when the gateway sends an x-fal-runner-health-check header, which happens between requests or after specific error conditions.
Handling GPU Errors in Your Code
If your code detects a GPU-level error during request processing (such as an out-of-memory condition or CUDA error), return a 503 status code to signal that the runner should be terminated and replaced. This is the appropriate response when the runner’s GPU state is corrupted and it cannot reliably serve further requests.Health Check Endpoint
Full configuration reference for health checks
Status Codes
How each HTTP status code affects runners and retries