Scaling Apps - Cerebrium

Cerebrium’s scaling system automatically manages computing resources to match app demand. The system handles everything from a few simple requests, to the processing of multiple requests simultaneously, while optimizing for both performance and cost.

How Autoscaling Works

The scaling system monitors two key metrics to make scaling decisions: The number of requests currently waiting for processing in the queue indicates immediate demand. Additionally, the system tracks how long each request has waited in the queue. When either of these metrics exceeds their thresholds, new instances start within 3 seconds to handle the increased load.

Scaling is also configurable based on the expected traffic of an application. See below for more information.

As traffic decreases, instances enter a cooldown period at reduced concurrency. If reduced concurrency is maintained for the cooldown duration, instances scale down to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.

Scaling Configuration

The cerebrium.toml file controls scaling behavior through several key parameters:

[scaling]
min_replicas = 0           # Minimum running instances
max_replicas = 3           # Maximum concurrent instances
cooldown = 60              # Cooldown period in seconds
replica_concurrency = 1    # The maximum number of requests each replica of an app can accept

Minimum Instances

The min_replicas parameter defines how many instances remain active at all times. Setting this to 1 or higher maintains warm instances ready for immediate response, eliminating cold starts but increasing costs. This configuration suits apps that require consistent response times or need to meet specific SLA requirements.

Maximum Instances

The max_replicas parameter sets an upper limit on concurrent instances, controlling costs and protecting backend systems. When traffic increases, new instances start automatically up to this configured maximum.

Cooldown Period

The cooldown parameter specifies the time window (in seconds) that must pass at reduced concurrency before an instance scales down. This prevents premature scale-down during brief traffic dips that might be followed by more requests. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.

Replica Concurrency

The number of requests an app instance can handle concurrently is dictated by the replica_concurrency parameter. This is a hard limit, and an individual replica will not accept more than this limit at a time. By default, once this concurrency limit is reached on an instance and there are still requests to be processed in-flight, the system will scale out by the number of new instances required to fulfil the in-flight requests. For example, if replica_concurrency=1 and there are 3 requests in flight with no replicas currently available, Cerebrium will scale out 3 instances of the application to meet that demand.

Typically most GPU applications will require that replica_concurrency is set to 1. If the workload requires GPU but higher throughput is desired, replica_concurrency may be increased so long as access to GPU resources is controlled within the application through batching.

Processing Multiple Requests

Apps can process multiple requests simultaneously using Cerebrium’s batching and concurrency features. The system offers native support for frameworks with built-in batching capabilities and enables custom implementations through the custom runtime feature. For detailed information about handling multiple requests efficiently, see our Batching & Concurrency Guide.

Instance Management

Cerebrium ensures reliability through automatic instance health management. The system restarts instances that encounter issues, quickly starts new instances to maintain processing capacity, and monitors instance health continuously. Apps requiring maximum reliability often combine several scaling features:

[scaling]
min_replicas = 2              # Maintain redundant instances
cooldown = 600                # Extended warm period
max_replicas = 10             # Room for traffic spikes
response_grace_period = 1200  # Maximum request lifespan ensuring graceful exit

The response_grace_period parameter stipulates how long in seconds a request would need at most to finish, and provides time for instances to complete active requests during normal operation and shutdown. During normal replica operation, this simply corresponds to a request timeout value. During replica shutdown, the Cerebrium system sends a SIGTERM signal to the replica, waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.

When using the cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you’ll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our Graceful Termination guide for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.

Performance metrics available through the dashboard help monitor scaling behavior:

Request processing times
Active instance count
Cold start frequency
Resource usage patterns

The system status and platform-wide metrics remain accessible through our status page, where Cerebrium maintains 99.9% uptime.

Using Scaling Metrics

Cerebrium provides a variety of scaling criteria which may be used to scale apps according to different metrics. As mentioned above, by default this is determined by an application’s replica_concurrency. However, this strategy may be insufficient for some use cases and so Cerebrium currently provides four scaling metrics to choose from:

concurrency_utilization
requests_per_second
cpu_utilization
memory_utilization

These can be added to the scaling section as such, by specifying one of these metrics and a target:

[scaling]
min_replicas = 0
cooldown = 600
max_replicas = 10
response_grace_period = 120
replica_concurrency = 1
scaling_metric = "concurrency_utilization"
scaling_target = 100

Concurrency Utilization

concurrency_utilization is the default scaling metric, and defaults to a target of 100% if not set explicitly. This scaling metric works by maintaining a maximum percentage of your replica_concurrency averaged across every instance of the app. For example, if an application has replica_concurrency=1 and scaling_target=70, Cerebrium will attempt to maintain 0.7 requests per instance across your entire deployed service. This strategy would always ensure an extra 30% capacity is provisioned. As a different example, say an app has replica_concurrency=200 and scaling_target=80. In this case, Cerebrium will maintain 160 requests per instance, and will begin to scale out once that target has exceeded.

Requests per Second

requests_per_second is straightforward criterion which aims to maintain a maximum application throughput measured in requests per second averaged over every application instance. This can be a more effective scale metric than concurrency_utilization if appropriate performance evaluation has been done on the application to determine the throughput. This criterion is not recommended for most GPU applications, since this scaling metric does not enforce concurrency limits. For example, if scaling_target=5, Cerebrium will attempt to maintain a 5 requests/s average across all app instances.

CPU Utilization

cpu_utilization uses a maximum CPU percentage utilization averaged over all instances of an application to scale out, relative to the hardware.cpu value. For example, if an application has cpu=2 and scaling_target=80, Cerebrium will attempt to maintain 80% CPU utilization (1.6 CPUs) per instance across your entire deployed service. Since there is no notion of scaling relative to 0 CPU units, it is required that min_replicas=1 if using this metric.

Memory Utilization

memory_utilization uses a maximum memory percentage utilization averaged over all instances of an application to scale out, relative to the hardware.memory value. Note this refers to RAM, not GPU VRAM utilization. For example, if an application has memory=10 and scaling_target=80, Cerebrium will attempt to maintain 80% Memory utilization (8GB) per instance across your entire deployed service. Since there is no notion of scaling relative to 0GB of memory, it is required that min_replicas=1 if using this metric.

Keeping a Scaling Buffer

In certain use-cases where app startup time or total request time is long and traffic is predictable, a consistent excess capacity which is dynamically added to the app’s current required capacity may be needed to meet adequate request throughput. Cerebrium provides a replica buffer mechanism to cater for this, whereby the scaling system aims to keep a specific excess replica capacity available at all times above what the scaling metric suggests. This is done through the scaling_buffer option in the config. Currently, this is only available when using the following scaling metrics:

concurrency_utilization
requests_per_second

The buffer can be added to the scaling section as such, by specifying scaling_buffer:

[scaling]
min_replicas = 1
cooldown = 600
max_replicas = 10
response_grace_period = 120
replica_concurrency = 1
scaling_metric = "concurrency_utilization"
scaling_target = 100
scaling_buffer = 3

To illustrate how this works using the above config, if this application receives no traffic, it will buffer out to 1 replica of capacity as a baseline - the buffer scales based on requests. Since the config has specified 100 as a target for concurrency_utilization and replica_concurrency=1, if the application now receives 1 request the autoscaler will suggest a value of 1 replica for scale out. Since however, we have scale_buffer=3, the application will actually scale one more replica to (1+3)=4. In other words, the scale buffer will simply add a static amount of replicas to the number of replicas the autoscaler suggests using the scale target. Once this request has completed, the usual cooldown period will apply, and the app replica count will scale down back to the baseline of 1 replica.

Evaluation Interval

Requires CLI version 2.1.5 or higher.

The evaluation_interval parameter controls the time window (in seconds) over which the autoscaler evaluates metrics before making scaling decisions. The default is 30 seconds, with a valid range of 6-300 seconds.

[scaling]
evaluation_interval = 30  # Evaluate metrics over 30-second windows

A shorter interval makes the autoscaler more responsive to traffic spikes but may cause more frequent scaling events. A longer interval smooths out transient spikes but may delay scaling responses.

For bursty workloads, a shorter evaluation_interval (e.g., 10-15 seconds) helps the system respond quickly to demand. For steady workloads, a longer interval (e.g., 60 seconds) reduces unnecessary scaling churn.

Load Balancing

Requires CLI version 2.1.5 or higher.

The load_balancing parameter controls how incoming requests are distributed across your replicas. When not specified, the system automatically selects the best algorithm based on your replica_concurrency setting.

[scaling]
load_balancing = "min-connections"  # Explicitly set load balancing algorithm

Default behavior: When load_balancing is not set, the system uses first-available for replica_concurrency <= 3 (typical for GPU workloads) and round-robin for higher concurrency.

Available Algorithms

round-robin

Cycles through replicas starting from the last successful target. Each replica’s concurrency limit is respected - if a replica is at capacity, the algorithm proceeds to the next one in rotation.

Characteristic	Value
Selection complexity	O(1) typical, O(N) worst case when scanning for available capacity
Latency profile	Consistent p50, good p90 under uniform load
Strategy	Stateful index rotation with mutex synchronization; skips full replicas

Best for: Workloads with predictable request times where you want even distribution across replicas over time.

first-available

Scans replicas from the start of the list and selects the first one with available capacity.

Characteristic	Value
Selection complexity	O(1) typical, O(N) worst case
Latency profile	Optimal p50 when load is light, may degrade p90 under high load
Strategy	Linear scan from list start; returns first replica that accepts via Reserve()

Best for: GPU workloads with low concurrency (replica_concurrency <= 3). Maximizes utilization of warm replicas before spreading load, reducing cold starts and keeping models in VRAM. Tradeoff: Earlier replicas in the list handle more traffic. This is desirable for GPU workloads but may cause uneven distribution for CPU workloads.

min-connections

Linear scan to find the replica with the fewest in-flight requests, then attempts to reserve it. If that replica cannot accept (at capacity), falls back to trying other replicas in iteration order.

Characteristic	Value
Selection complexity	Θ(N) - always scans all replicas to find minimum
Latency profile	Best p90/p99 tail latency
Strategy	Single pass to find minimum in-flight; fallback in iteration order

Best for: Workloads with variable request times (e.g., LLM inference where output length varies). Routes new requests to the least busy replica, preventing fast requests from queuing behind slow ones.

random-choice-2

Implements the “Power of Two Choices” algorithm: randomly samples two replicas and routes to the one with lower weight (based on active request tracking). Ties are broken randomly.

Characteristic	Value
Selection complexity	Θ(1) - constant time regardless of replica count
Latency profile	Good balance of p50 and p90
Strategy	Sample 2 random replicas, compare weights, pick lighter one

Best for: High-throughput scenarios with many replicas where selection overhead matters. Research shows this achieves exponentially better load distribution than pure random selection. Note: Uses weight-based tracking rather than reservation-based concurrency limiting, making it suitable for unlimited concurrency scenarios.

Choosing an Algorithm

Scenario	Recommended	Reason
GPU inference, `replica_concurrency=1`	`first-available` (default)	Maximizes GPU utilization, keeps models warm
LLMs with variable output lengths	`min-connections`	Prevents head-of-line blocking, best tail latency
High-throughput, many replicas	`random-choice-2`	Θ(1) selection with near-optimal distribution
Uniform request times, even distribution	`round-robin`	Predictable rotation, no hot spots over time
Latency-sensitive with variable load	`min-connections`	Minimizes p90/p99 by routing to least busy replica

​How Autoscaling Works

​Scaling Configuration

​Minimum Instances

​Maximum Instances

​Cooldown Period

​Replica Concurrency

​Processing Multiple Requests

​Instance Management

​Using Scaling Metrics

​Concurrency Utilization

​Requests per Second

​CPU Utilization

​Memory Utilization

​Keeping a Scaling Buffer

​Evaluation Interval

​Load Balancing

​Available Algorithms

​round-robin

​first-available

​min-connections

​random-choice-2

​Choosing an Algorithm

How Autoscaling Works

Scaling Configuration

Minimum Instances

Maximum Instances

Cooldown Period

Replica Concurrency

Processing Multiple Requests

Instance Management

Using Scaling Metrics

Concurrency Utilization

Requests per Second

CPU Utilization

Memory Utilization

Keeping a Scaling Buffer

Evaluation Interval

Load Balancing

Available Algorithms

round-robin

first-available

min-connections

random-choice-2

Choosing an Algorithm