Azure API Management (API-M) helps you publish and securely manage custom Application Programming Interfaces (APIs), acting as a gateway between clients and backend APIs.
Azure OpenAI (AOAI) lets you deploy and use OpenAI's powerful large language models (LLMs) like GPT-4o on Azure to process and generate multimodal content and easily integrate with other solutions of your choice.
In this repo, I'll demonstrate how to combine the functionalities of API-M and AOAI to enable the following use-case scenarios:
Enforce custom token limit
, so that calling apps can co-share AOAI backends without causing "noisy neighbour" situations;Get detailed token usage breakdown
, to understand consumption and accurately re-charge cost to customers or business functions;Enable load-balancing between target AOAI deployments
to ensure data residency, performance and reliability of your AI solutions.
- Scenario 1: Enforcing custom token limit
- Scenario 2: Usage analysis by specific customer
- Scenario 3: Load-balancing between several AOAI endpoints
This section describes setting up API-M and then end-to-end testing of the token limit enforcement scenario.
- In the Azure portal, navigate to your API Management settings. Under APIs, click Add API and then select the "Azure OpenAI Service" tile under the "Create from Azure resource" category.
- Select your existing AOAI resource, and enter values for the Display Name and Name fields. Optionally, you can tick the SDK Compatibility field, to enable OpenAI-compatible consumption of exposed APIs from popular Generative AI frameworks and libraries:
- After clicking Next, enable the "Manage token consumption" API-M policy and set your desired Tokens-per-Minute (TPM) limit. You can optionally add "consumed tokens" and "remaining tokens" headers to API-M endpoint's responses.
Note: The provided Jupyter notebook assumes you have both headers enabled and it was tested against an API-M endpoint with a 100 TPM limit.
- Once you click the Create button, a new set of APIs will be provisioned to support interactions with various AOAI models. API-M will also add the token limit policy to all new API operations. Technical aspects of this policy can be found in this reference document:
<policies>
<inbound>
<set-backend-service id="apim-generated-policy" backend-id="aoai-tpm-limit-openai-endpoint" />
<azure-openai-token-limit tokens-per-minute="100" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="false" tokens-consumed-header-name="consumed-tokens" remaining-tokens-header-name="remaining-tokens" />
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
<base />
</inbound>
<backend>
<base />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
- To test your TPM limit, ensure that you set the following 4 environment variables before running the notebook:
Environment Variable | Description |
---|---|
APIM_TPM_AOAI_DEPLOY | Name of AOAI deployment |
APIM_TPM_API_VERSION | API version of AOAI endpoint |
APIM_TPM_SUB_KEY | Subscription key, with the scope of target API-M APIs |
APIM_TPM_URL | URL of provisioned API-M's API for AOAI endpoint |
- We can now use a Helper function to interact with the AOAI backend through the API-M endpoint:
def get_rest_completion(system_prompt, user_prompt):
response = requests.post(
url = f"{APIM_TPM_URL}openai/deployments/{AOAI_DEPLOYMENT}/chat/completions",
headers = {
"Content-Type": "application/json",
"api-key": APIM_TPM_SUB_KEY
},
params={'api-version': AOAI_API_VERSION},
json = {
"messages": [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": user_prompt
}
]
}
)
return response
- If you set your TPM value to 100 and the average consumption of tokens in your request is about 50, then after a few API calls, you should reach the token limit, with API-M enforcing the new policy as shown in the testing results below:
Run # 0 completed in 1.93 seconds
Consumed tokens: 59
Remaining tokens: 41
Pausing for 15 seconds...
-----------------------------
Run # 1 completed in 0.78 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 2 completed in 0.40 seconds
Response code: 429
Response message: Token limit is exceeded. Try again in 29 seconds.
Pausing for 15 seconds...
-----------------------------
Run # 3 completed in 0.35 seconds
Response code: 429
Response message: Token limit is exceeded. Try again in 14 seconds.
Pausing for 15 seconds...
-----------------------------
Run # 4 completed in 0.91 seconds
Consumed tokens: 55
Remaining tokens: 0
-----------------------------
- If you enabled SDK compatibility in Step 2 above, you could use the OpenAI Python SDK to interact with your AOAI models through the API-M endpoint. Here's how to instantiate the AzureOpenAI class with your API-M's subscription key:
client = AzureOpenAI(
azure_endpoint = APIM_TPM_URL,
api_key = APIM_TPM_SUB_KEY,
api_version = AOAI_API_VERSION
)
- This enables an OpenAI-compatible interface, with an example Helper function shown below:
def get_sdk_completion(system_prompt, prompt):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
response = client.chat.completions.create(
model = AOAI_DEPLOYMENT,
messages = messages
)
return response
- When using the SDK interface, instead of receiving 429 errors, you might observe throttling enforced by API-M, because of the TPM limit policy:
Run # 0 completed in 45.82 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 1 completed in 0.65 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 2 completed in 30.87 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 3 completed in 1.20 seconds
Consumed tokens: 63
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 4 completed in 29.88 seconds
Consumed tokens: 56
Remaining tokens: 0
-----------------------------
This section describes setting up API-M and then performing end-to-end testing of the token usage collection and visualisation process.
- Repeat Steps # 1 and 2 from Scenario 1 above.
- After clicking Next, enable "Track token usage" API-M policy. Select an existing Application Insights instance to log token metrics into and add dimensions that you want the metrics to be grouped by:
Note: The provided Jupyter notebook assumes that you have added Subscription ID as one of the logging dimensions.
- Once you click the Create button, a new set of APIs will be provisioned to support interactions with various AOAI models. API-M will also add the token usage metrics policy to all new API operations. Technical aspects of this policy can be found in this reference document:
<policies>
<inbound>
<set-backend-service id="apim-generated-policy" backend-id="aoai-usage-by-cx-openai-endpoint" />
<azure-openai-emit-token-metric namespace="AzureOpenAI">
<dimension name="Subscription ID" value="@(context.Subscription.Id)" />
</azure-openai-emit-token-metric>
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
<base />
</inbound>
<backend>
<base />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
- If you want to log and visualise tokens usage, ensure that you set the following 5 environment variables before running the notebook:
Environment Variable | Description |
---|---|
APIM_USAGE_AOAI_DEPLOY | Name of AOAI deployment |
APIM_USAGE_API_VERSION | API version of AOAI endpoint |
APIM_USAGE_KEY_CONTOSO | Subscription key, created for Contoso client |
APIM_USAGE_KEY_NORTHWIND | Subscription key, created for Northwind client |
APIM_USAGE_URL | URL of provisioned API-M's API for AOAI endpoint |
- You can now generate workload with a degree of randomness for both Contoso and Northwind clients, connected to the same Azure OpenAI deployment:
for key in SUBSCRIPTION_KEYS:
randomness = random.randint(0, 5)
for i in range(NUMBER_OF_RUNS - randomness):
start_time = time.time()
response = get_rest_completion(subscription_key=key, system_prompt=SYSTEM_PROMPT, user_prompt=USER_PROMPT)
end_time = time.time()
print(f"Run # {i} completed in {end_time - start_time:.2f} seconds with response code {response.status_code}")
if i < NUMBER_OF_RUNS - 1:
print(f"Pausing for {SLEEP_TIME} seconds...")
time.sleep(SLEEP_TIME)
print("-----------------------------")
- Collected token usage logs can be visualised in Application Insights charts, e.g. the total tokens split by Subscription IDs of Contoso and Northwind as shown below:
This section describes setting up API-M and then performing end-to-end testing of an AOAI load-balancing scenario.
- For each backend AOAI endpoint, you can configure circuit breaker logic using API-M's REST API. Such logic determines when to temporarily stop sending requests to an unhealthy endpoint. The provided LoadBalancer_CircuitBreaker.json can be re-used as a jump-start template, where you trip the circuit breaker for 30 seconds if the AOAI endpoint returns 429 (Too Many Requests) or 5xx (server errors) within any 2-second interval.
{
"properties": {
"description": "<DESCRIPTION>",
"title": "<TITLE>",
"type": "Single",
"protocol": "http",
"url": "<URL>",
"circuitBreaker": {
"rules": [
{
"failureCondition": {
"count": 1,
"interval": "PT2S",
"statusCodeRanges": [
{
"min": 429,
"max": 429
},
{
"min": 500,
"max": 599
}
]
},
"name": "<NAME>",
"tripDuration": "PT30S",
"acceptRetryAfter": true
}
]
}
}
}
Note: At the time of writing, configuring circuit breakers directly within the Azure portal UI for API-M was not supported.
- You can combine your backend AOAI endpoints into a load-balancing pool, using either round-robin, weight-based or priority-based logic. The provided
LoadBalancer_Pool.json
can be re-used as a jump-start template to configure such a pool.
{
"properties": {
"description": "<DESCRIPTION>",
"title": "<TITLE>",
"type": "Pool",
"pool": {
"services": [
{
"id": "<BACKEND_1>",
"priority": 1
},
{
"id": "<BACKEND_2>",
"priority": 2
}
]
}
}
}
Note: At the time of writing, configuring load-balancing pools directly within the Azure portal UI for API-M was not supported.
- If you want to test load-balancing between your defined AOAI endpoints, ensure that you set the following 4 environment variables before running the provided Jupyter notebook:
Environment Variable | Description |
---|---|
APIM_LB_AOAI_DEPLOY | Name of AOAI deployment |
APIM_LB_API_VERSION | API version of AOAI endpoint |
APIM_LB_SUB_KEY | Subscription key, created for load-balancing API-M endpoint |
APIM_LB_URL | URL of load-balancing API-M's API for AOAI endpoint |
- Consider a use-case where you configure an AOAI deployment of GPT-4o in Sweden Central with an ultra-low Token-per-Minute (TPM) quota of 1K. You then load-balance it with a higher TPM quota GPT-4 deployment in France Central. Your test results might be similar to what is shown below, with successful routing to the France Central endpoint when the circuit breaker trips for the Sweden Central endpoint:
Run # 0: Sweden Central, Duration: 0.84, Response Code: 200
Pausing for 2 seconds...
Run # 1: None, Duration: 1.23, Response Code: 503
Pausing for 2 seconds...
Run # 2: France Central, Duration: 2.57, Response Code: 200
Pausing for 2 seconds...
Run # 3: France Central, Duration: 1.94, Response Code: 200
Pausing for 2 seconds...
Run # 4: France Central, Duration: 1.97, Response Code: 200
Pausing for 2 seconds...
Run # 5: France Central, Duration: 2.18, Response Code: 200
Pausing for 2 seconds...
Run # 6: France Central, Duration: 1.72, Response Code: 200
Pausing for 2 seconds...
Run # 7: France Central, Duration: 2.17, Response Code: 200
Pausing for 2 seconds...
Run # 8: France Central, Duration: 1.99, Response Code: 200
Pausing for 2 seconds...
Run # 9: Sweden Central, Duration: 0.88, Response Code: 200