Skip to main content

Proxy - Fallbacks, Retries

Quick Start - Load Balancing​

Step 1 - Set deployments on config​

Example config below. Here requests with model=gpt-3.5-turbo will be routed across multiple instances of azure/gpt-3.5-turbo

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/<your-deployment-name>
api_base: <your-azure-endpoint>
api_key: <your-azure-api-key>
rpm: 6 # Rate limit for this deployment: in requests per minute (rpm)
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-ca
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
api_key: <your-azure-api-key>
rpm: 6
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-large
api_base: https://openai-france-1234.openai.azure.com/
api_key: <your-azure-api-key>
rpm: 1440

router_settings:
routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"
model_group_alias: {"gpt-4": "gpt-3.5-turbo"} # all requests with `gpt-4` will be routed to models with `gpt-3.5-turbo`
num_retries: 2
timeout: 30 # 30 seconds
redis_host: <your redis host> # set this when using multiple litellm proxy deployments, load balancing state stored in redis
redis_password: <your redis password>
redis_port: 1992
info

Detailed information about routing strategies can be found here

Step 2: Start Proxy with config​

$ litellm --config /path/to/config.yaml

Test - Simple Call​

Here requests with model=gpt-3.5-turbo will be routed across multiple instances of azure/gpt-3.5-turbo

👉 Key Change: model="gpt-3.5-turbo"

Check the model_id in Response Headers to make sure the requests are being load balanced

import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
]
)

print(response)

Test - Loadbalancing​

In this request, the following will occur:

  1. A rate limit exception will be raised
  2. LiteLLM proxy will retry the request on the model group (default is 3).
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Hi there!"}
],
"mock_testing_rate_limit_error": true
}'

See Code

Test - Client Side Fallbacks​

In this request the following will occur:

  1. The request to model="zephyr-beta" will fail
  2. litellm proxy will loop through all the model_groups specified in fallbacks=["gpt-3.5-turbo"]
  3. The request to model="gpt-3.5-turbo" will succeed and the client making the request will get a response from gpt-3.5-turbo

👉 Key Change: "fallbacks": ["gpt-3.5-turbo"]

import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
model="zephyr-beta",
messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
],
extra_body={
"fallbacks": ["gpt-3.5-turbo"]
}
)

print(response)

Advanced​

Fallbacks + Retries + Timeouts + Cooldowns​

To set fallbacks, just do:

litellm_settings:
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}]

Covers all errors (429, 500, etc.)

Set via config

model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: <my-openai-key>
- model_name: gpt-3.5-turbo-16k
litellm_params:
model: gpt-3.5-turbo-16k
api_key: <my-openai-key>

litellm_settings:
num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
request_timeout: 10 # raise Timeout error if call takes longer than 10s. Sets litellm.request_timeout
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries
allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.
cooldown_time: 30 # how long to cooldown model if fails/min > allowed_fails

Fallback to Specific Model ID​

If all models in a group are in cooldown (e.g. rate limited), LiteLLM will fallback to the model with the specific model ID.

This skips any cooldown check for the fallback model.

  1. Specify the model ID in model_info
model_list:
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
model_info:
id: my-specific-model-id # 👈 KEY CHANGE
- model_name: gpt-4
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
- model_name: anthropic-claude
litellm_params:
model: anthropic/claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY

Note: This will only fallback to the model with the specific model ID. If you want to fallback to another model group, you can set fallbacks=[{"gpt-4": ["anthropic-claude"]}]

  1. Set fallbacks in config
litellm_settings:
fallbacks: [{"gpt-4": ["my-specific-model-id"]}]
  1. Test it!
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_fallbacks": true
}'

Validate it works, by checking the response header x-litellm-model-id

x-litellm-model-id: my-specific-model-id

Test Fallbacks!​

Check if your fallbacks are working as expected.

Regular Fallbacks​

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "my-bad-model",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_fallbacks": true # 👈 KEY CHANGE
}
'

Content Policy Fallbacks​

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "my-bad-model",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_content_policy_fallbacks": true # 👈 KEY CHANGE
}
'

Context Window Fallbacks​

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "my-bad-model",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"mock_testing_context_window_fallbacks": true # 👈 KEY CHANGE
}
'

Context Window Fallbacks (Pre-Call Checks + Fallbacks)​

Before call is made check if a call is within model context window with enable_pre_call_checks: true.

See Code

1. Setup config

For azure deployments, set the base model. Pick the base model from this list, all the azure models start with azure/.

Filter older instances of a model (e.g. gpt-3.5-turbo) with smaller context windows

router_settings:
enable_pre_call_checks: true # 1. Enable pre-call checks

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
model_info:
base_model: azure/gpt-4-1106-preview # 2. 👈 (azure-only) SET BASE MODEL

- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo-1106
api_key: os.environ/OPENAI_API_KEY

2. Start proxy

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

3. Test it!

import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)

text = "What is the meaning of 42?" * 5000

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{"role": "system", "content": text},
{"role": "user", "content": "Who was Alexander?"},
],
)

print(response)

Content Policy Fallbacks​

Fallback across providers (e.g. from Azure OpenAI to Anthropic) if you hit content policy violation errors.

model_list:
- model_name: gpt-3.5-turbo-small
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"

- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
content_policy_fallbacks: [{"gpt-3.5-turbo-small": ["claude-opus"]}]

Default Fallbacks​

You can also set default_fallbacks, in case a specific model group is misconfigured / bad.

model_list:
- model_name: gpt-3.5-turbo-small
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"

- model_name: claude-opus
litellm_params:
model: claude-3-opus-20240229
api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
default_fallbacks: ["claude-opus"]

This will default to claude-opus in case any model fails.

A model-specific fallbacks (e.g. {"gpt-3.5-turbo-small": ["claude-opus"]}) overrides default fallback.

EU-Region Filtering (Pre-Call Checks)​

Before call is made check if a call is within model context window with enable_pre_call_checks: true.

Set 'region_name' of deployment.

Note: LiteLLM can automatically infer region_name for Vertex AI, Bedrock, and IBM WatsonxAI based on your litellm params. For Azure, set litellm.enable_preview = True.

1. Set Config

router_settings:
enable_pre_call_checks: true # 1. Enable pre-call checks

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
region_name: "eu" # 👈 SET EU-REGION

- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo-1106
api_key: os.environ/OPENAI_API_KEY

- model_name: gemini-pro
litellm_params:
model: vertex_ai/gemini-pro-1.5
vertex_project: adroit-crow-1234
vertex_location: us-east1 # 👈 AUTOMATICALLY INFERS 'region_name'

2. Start proxy

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

3. Test it!

import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.with_raw_response.create(
model="gpt-3.5-turbo",
messages = [{"role": "user", "content": "Who was Alexander?"}]
)

print(response)

print(f"response.headers.get('x-litellm-model-api-base')")

Custom Timeouts, Stream Timeouts - Per Model​

For each model you can set timeout & stream_timeout under litellm_params

model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-eu
api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
api_key: <your-key>
timeout: 0.1 # timeout in (seconds)
stream_timeout: 0.01 # timeout for stream requests (seconds)
max_retries: 5
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-ca
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
api_key:
timeout: 0.1 # timeout in (seconds)
stream_timeout: 0.01 # timeout for stream requests (seconds)
max_retries: 5

Start Proxy​

$ litellm --config /path/to/config.yaml

Setting Dynamic Timeouts - Per Request​

LiteLLM Proxy supports setting a timeout per request

Example Usage

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "what color is red"}
],
"logit_bias": {12481: 100},
"timeout": 1
}'

Setting Fallbacks for Wildcard Models​

You can set fallbacks for wildcard models (e.g. azure/*) in your config file.

  1. Setup config
model_list:
- model_name: "gpt-4o"
litellm_params:
model: "openai/gpt-4o"
api_key: os.environ/OPENAI_API_KEY
- model_name: "azure/*"
litellm_params:
model: "azure/*"
api_key: os.environ/AZURE_API_KEY
api_base: os.environ/AZURE_API_BASE

litellm_settings:
fallbacks: [{"gpt-4o": ["azure/gpt-4o"]}]
  1. Start Proxy
litellm --config /path/to/config.yaml
  1. Test it!
curl -L -X POST 'http://0.0.0.0:4000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "what color is red"
}
]
}
],
"max_tokens": 300,
"mock_testing_fallbacks": true
}'

Disable Fallbacks per key​

You can disable fallbacks per key by setting disable_fallbacks: true in your key metadata.

curl -L -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"metadata": {
"disable_fallbacks": true
}
}'