Azure API Management's AI Gateway for AOAI use-case scenarios

Azure API Management (API-M) helps you publish and securely manage custom Application Programming Interfaces (APIs), acting as a gateway between clients and backend APIs.

Azure OpenAI (AOAI) lets you deploy and use OpenAI's powerful large language models (LLMs) like GPT-4o on Azure to process and generate multimodal content and easily integrate with other solutions of your choice.

In this repo, I'll demonstrate how to combine the functionalities of API-M and AOAI to enable the following use-case scenarios:

Enforce custom token limit, so that calling apps can co-share AOAI backends without causing "noisy neighbour" situations;
Get detailed token usage breakdown, to understand consumption and accurately re-charge cost to customers or business functions;
Enable load-balancing between target AOAI deployments to ensure data residency, performance and reliability of your AI solutions.

Scenario 1: Enforcing custom token limit

This section describes setting up API-M and then end-to-end testing of the token limit enforcement scenario.

In the Azure portal, navigate to your API Management settings. Under APIs, click Add API and then select the "Azure OpenAI Service" tile under the "Create from Azure resource" category.
Select your existing AOAI resource, and enter values for the Display Name and Name fields. Optionally, you can tick the SDK Compatibility field, to enable OpenAI-compatible consumption of exposed APIs from popular Generative AI frameworks and libraries:
After clicking Next, enable the "Manage token consumption" API-M policy and set your desired Tokens-per-Minute (TPM) limit. You can optionally add "consumed tokens" and "remaining tokens" headers to API-M endpoint's responses.

Note: The provided Jupyter notebook assumes you have both headers enabled and it was tested against an API-M endpoint with a 100 TPM limit.

Once you click the Create button, a new set of APIs will be provisioned to support interactions with various AOAI models. API-M will also add the token limit policy to all new API operations. Technical aspects of this policy can be found in this reference document:

<policies>
    <inbound>
        <set-backend-service id="apim-generated-policy" backend-id="aoai-tpm-limit-openai-endpoint" />
        <azure-openai-token-limit tokens-per-minute="100" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="false" tokens-consumed-header-name="consumed-tokens" remaining-tokens-header-name="remaining-tokens" />
        <authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
        <base />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

To test your TPM limit, ensure that you set the following 4 environment variables before running the notebook:

Environment Variable	Description
APIM_TPM_AOAI_DEPLOY	Name of AOAI deployment
APIM_TPM_API_VERSION	API version of AOAI endpoint
APIM_TPM_SUB_KEY	Subscription key, with the scope of target API-M APIs
APIM_TPM_URL	URL of provisioned API-M's API for AOAI endpoint

We can now use a Helper function to interact with the AOAI backend through the API-M endpoint:

def get_rest_completion(system_prompt, user_prompt):
    response = requests.post(
        url = f"{APIM_TPM_URL}openai/deployments/{AOAI_DEPLOYMENT}/chat/completions",
        headers = {
            "Content-Type": "application/json",
            "api-key": APIM_TPM_SUB_KEY
        },
        params={'api-version': AOAI_API_VERSION},
        json = {
            "messages": [
                {
                   "role": "system",
                    "content": system_prompt
                },
                {
                    "role": "user",
                    "content": user_prompt
                }
            ]
        }
    )
    return response

If you set your TPM value to 100 and the average consumption of tokens in your request is about 50, then after a few API calls, you should reach the token limit, with API-M enforcing the new policy as shown in the testing results below:

Run # 0 completed in 1.93 seconds
Consumed tokens: 59
Remaining tokens: 41
Pausing for 15 seconds...
-----------------------------
Run # 1 completed in 0.78 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 2 completed in 0.40 seconds
Response code: 429
Response message: Token limit is exceeded. Try again in 29 seconds.
Pausing for 15 seconds...
-----------------------------
Run # 3 completed in 0.35 seconds
Response code: 429
Response message: Token limit is exceeded. Try again in 14 seconds.
Pausing for 15 seconds...
-----------------------------
Run # 4 completed in 0.91 seconds
Consumed tokens: 55
Remaining tokens: 0
-----------------------------

If you enabled SDK compatibility in Step 2 above, you could use the OpenAI Python SDK to interact with your AOAI models through the API-M endpoint. Here's how to instantiate the AzureOpenAI class with your API-M's subscription key:

client = AzureOpenAI(
    azure_endpoint = APIM_TPM_URL,
    api_key = APIM_TPM_SUB_KEY,
    api_version = AOAI_API_VERSION
)

This enables an OpenAI-compatible interface, with an example Helper function shown below:

def get_sdk_completion(system_prompt, prompt):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]

    response = client.chat.completions.create(
        model = AOAI_DEPLOYMENT,
        messages = messages
    )
    return response

When using the SDK interface, instead of receiving 429 errors, you might observe throttling enforced by API-M, because of the TPM limit policy:

Run # 0 completed in 45.82 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 1 completed in 0.65 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 2 completed in 30.87 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 3 completed in 1.20 seconds
Consumed tokens: 63
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 4 completed in 29.88 seconds
Consumed tokens: 56
Remaining tokens: 0
-----------------------------

Scenario 2: Usage analysis by specific customer

This section describes setting up API-M and then performing end-to-end testing of the token usage collection and visualisation process.

Repeat Steps # 1 and 2 from Scenario 1 above.
After clicking Next, enable "Track token usage" API-M policy. Select an existing Application Insights instance to log token metrics into and add dimensions that you want the metrics to be grouped by:

Note: The provided Jupyter notebook assumes that you have added Subscription ID as one of the logging dimensions.

Once you click the Create button, a new set of APIs will be provisioned to support interactions with various AOAI models. API-M will also add the token usage metrics policy to all new API operations. Technical aspects of this policy can be found in this reference document:

<policies>
    <inbound>
        <set-backend-service id="apim-generated-policy" backend-id="aoai-usage-by-cx-openai-endpoint" />
        <azure-openai-emit-token-metric namespace="AzureOpenAI">
            <dimension name="Subscription ID" value="@(context.Subscription.Id)" />
        </azure-openai-emit-token-metric>
        <authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
        <base />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

If you want to log and visualise tokens usage, ensure that you set the following 5 environment variables before running the notebook:

Environment Variable	Description
APIM_USAGE_AOAI_DEPLOY	Name of AOAI deployment
APIM_USAGE_API_VERSION	API version of AOAI endpoint
APIM_USAGE_KEY_CONTOSO	Subscription key, created for Contoso client
APIM_USAGE_KEY_NORTHWIND	Subscription key, created for Northwind client
APIM_USAGE_URL	URL of provisioned API-M's API for AOAI endpoint

You can now generate workload with a degree of randomness for both Contoso and Northwind clients, connected to the same Azure OpenAI deployment:

for key in SUBSCRIPTION_KEYS:
    randomness = random.randint(0, 5)
    for i in range(NUMBER_OF_RUNS - randomness):    
        start_time = time.time()
        response = get_rest_completion(subscription_key=key, system_prompt=SYSTEM_PROMPT, user_prompt=USER_PROMPT)
        end_time = time.time()
        print(f"Run # {i} completed in {end_time - start_time:.2f} seconds with response code {response.status_code}")
    
        if i < NUMBER_OF_RUNS - 1:
            print(f"Pausing for {SLEEP_TIME} seconds...")
            time.sleep(SLEEP_TIME)
    print("-----------------------------")

Collected token usage logs can be visualised in Application Insights charts, e.g. the total tokens split by Subscription IDs of Contoso and Northwind as shown below:

Scenario 3: Load-balancing between several AOAI endpoints

This section describes setting up API-M and then performing end-to-end testing of an AOAI load-balancing scenario.

For each backend AOAI endpoint, you can configure circuit breaker logic using API-M's REST API. Such logic determines when to temporarily stop sending requests to an unhealthy endpoint. The provided LoadBalancer_CircuitBreaker.json can be re-used as a jump-start template, where you trip the circuit breaker for 30 seconds if the AOAI endpoint returns 429 (Too Many Requests) or 5xx (server errors) within any 2-second interval.

{
    "properties": {
        "description": "<DESCRIPTION>",
        "title": "<TITLE>",
        "type": "Single",
        "protocol": "http",
        "url": "<URL>",
        "circuitBreaker": {
            "rules": [
                {
                    "failureCondition": {
                        "count": 1,
                        "interval": "PT2S",
                        "statusCodeRanges": [
                            {
                                "min": 429,
                                "max": 429
                            },
                            {
                                "min": 500,
                                "max": 599
                            }
                        ]
                    },
                    "name": "<NAME>",
                    "tripDuration": "PT30S",
                    "acceptRetryAfter": true
                }
            ]
        }
    }
}

Note: At the time of writing, configuring circuit breakers directly within the Azure portal UI for API-M was not supported.

You can combine your backend AOAI endpoints into a load-balancing pool, using either round-robin, weight-based or priority-based logic. The provided LoadBalancer_Pool.json can be re-used as a jump-start template to configure such a pool.

{
    "properties": {
        "description": "<DESCRIPTION>",
        "title": "<TITLE>",
        "type": "Pool",
        "pool": {
            "services": [
                {
                    "id": "<BACKEND_1>",
                    "priority": 1
                },
                {
                    "id": "<BACKEND_2>",
                    "priority": 2
                }
            ]
        }
    }
}

Note: At the time of writing, configuring load-balancing pools directly within the Azure portal UI for API-M was not supported.

If you want to test load-balancing between your defined AOAI endpoints, ensure that you set the following 4 environment variables before running the provided Jupyter notebook:

Environment Variable	Description
APIM_LB_AOAI_DEPLOY	Name of AOAI deployment
APIM_LB_API_VERSION	API version of AOAI endpoint
APIM_LB_SUB_KEY	Subscription key, created for load-balancing API-M endpoint
APIM_LB_URL	URL of load-balancing API-M's API for AOAI endpoint

Consider a use-case where you configure an AOAI deployment of GPT-4o in Sweden Central with an ultra-low Token-per-Minute (TPM) quota of 1K. You then load-balance it with a higher TPM quota GPT-4 deployment in France Central. Your test results might be similar to what is shown below, with successful routing to the France Central endpoint when the circuit breaker trips for the Sweden Central endpoint:

Run # 0: Sweden Central, Duration: 0.84, Response Code: 200
Pausing for 2 seconds...
Run # 1: None, Duration: 1.23, Response Code: 503
Pausing for 2 seconds...
Run # 2: France Central, Duration: 2.57, Response Code: 200
Pausing for 2 seconds...
Run # 3: France Central, Duration: 1.94, Response Code: 200
Pausing for 2 seconds...
Run # 4: France Central, Duration: 1.97, Response Code: 200
Pausing for 2 seconds...
Run # 5: France Central, Duration: 2.18, Response Code: 200
Pausing for 2 seconds...
Run # 6: France Central, Duration: 1.72, Response Code: 200
Pausing for 2 seconds...
Run # 7: France Central, Duration: 2.17, Response Code: 200
Pausing for 2 seconds...
Run # 8: France Central, Duration: 1.99, Response Code: 200
Pausing for 2 seconds...
Run # 9: Sweden Central, Duration: 0.88, Response Code: 200

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
images		images
AOAI_APIM_Load_Balance.ipynb		AOAI_APIM_Load_Balance.ipynb
AOAI_APIM_TPM_Limit.ipynb		AOAI_APIM_TPM_Limit.ipynb
AOAI_APIM_Usage_Analysis.ipynb		AOAI_APIM_Usage_Analysis.ipynb
LICENSE		LICENSE
LoadBalancer_CircuitBreaker.json		LoadBalancer_CircuitBreaker.json
LoadBalancer_Pool.json		LoadBalancer_Pool.json
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure API Management's AI Gateway for AOAI use-case scenarios

Table of contents:

Scenario 1: Enforcing custom token limit

Scenario 2: Usage analysis by specific customer

Scenario 3: Load-balancing between several AOAI endpoints

About

Releases

Packages

Languages

License

LazaUK/AOAI-APIM-AIGateway

Folders and files

Latest commit

History

Repository files navigation

Azure API Management's AI Gateway for AOAI use-case scenarios

Table of contents:

Scenario 1: Enforcing custom token limit

Scenario 2: Usage analysis by specific customer

Scenario 3: Load-balancing between several AOAI endpoints

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages