Skip to content

Practical use of Azure API Management's AI Gateway to manage Azure OpenAI operations.

License

Notifications You must be signed in to change notification settings

LazaUK/AOAI-APIM-AIGateway

Repository files navigation

Azure API Management's AI Gateway for AOAI use-case scenarios

Azure API Management (API-M) helps you publish and securely manage custom Application Programming Interfaces (APIs), acting as a gateway between clients and backend APIs.

Azure OpenAI (AOAI) lets you deploy and use OpenAI's powerful large language models (LLMs) like GPT-4o on Azure to process and generate multimodal content and easily integrate with other solutions of your choice.

In this repo, I'll demonstrate how to combine the functionalities of API-M and AOAI to enable the following use-case scenarios:

  • Enforce custom token limit, so that calling apps can co-share AOAI backends without causing "noisy neighbour" situations;
  • Get detailed token usage breakdown, to understand consumption and accurately re-charge cost to customers or business functions;
  • Enable load-balancing between target AOAI deployments to ensure data residency, performance and reliability of your AI solutions.

Table of contents:

Scenario 1: Enforcing custom token limit

This section describes setting up API-M and then end-to-end testing of the token limit enforcement scenario.

  1. In the Azure portal, navigate to your API Management settings. Under APIs, click Add API and then select the "Azure OpenAI Service" tile under the "Create from Azure resource" category. APIM - Adding APIs for AOAI
  2. Select your existing AOAI resource, and enter values for the Display Name and Name fields. Optionally, you can tick the SDK Compatibility field, to enable OpenAI-compatible consumption of exposed APIs from popular Generative AI frameworks and libraries: APIM - Defining AOAI endpoint
  3. After clicking Next, enable the "Manage token consumption" API-M policy and set your desired Tokens-per-Minute (TPM) limit. You can optionally add "consumed tokens" and "remaining tokens" headers to API-M endpoint's responses. APIM - Enabling TPM policy

Note: The provided Jupyter notebook assumes you have both headers enabled and it was tested against an API-M endpoint with a 100 TPM limit.

  1. Once you click the Create button, a new set of APIs will be provisioned to support interactions with various AOAI models. API-M will also add the token limit policy to all new API operations. Technical aspects of this policy can be found in this reference document:
<policies>
    <inbound>
        <set-backend-service id="apim-generated-policy" backend-id="aoai-tpm-limit-openai-endpoint" />
        <azure-openai-token-limit tokens-per-minute="100" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="false" tokens-consumed-header-name="consumed-tokens" remaining-tokens-header-name="remaining-tokens" />
        <authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
        <base />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>
  1. To test your TPM limit, ensure that you set the following 4 environment variables before running the notebook: APIM - Setting TPM environment variables
Environment Variable Description
APIM_TPM_AOAI_DEPLOY Name of AOAI deployment
APIM_TPM_API_VERSION API version of AOAI endpoint
APIM_TPM_SUB_KEY Subscription key, with the scope of target API-M APIs
APIM_TPM_URL URL of provisioned API-M's API for AOAI endpoint
  1. We can now use a Helper function to interact with the AOAI backend through the API-M endpoint:
def get_rest_completion(system_prompt, user_prompt):
    response = requests.post(
        url = f"{APIM_TPM_URL}openai/deployments/{AOAI_DEPLOYMENT}/chat/completions",
        headers = {
            "Content-Type": "application/json",
            "api-key": APIM_TPM_SUB_KEY
        },
        params={'api-version': AOAI_API_VERSION},
        json = {
            "messages": [
                {
                   "role": "system",
                    "content": system_prompt
                },
                {
                    "role": "user",
                    "content": user_prompt
                }
            ]
        }
    )
    return response
  1. If you set your TPM value to 100 and the average consumption of tokens in your request is about 50, then after a few API calls, you should reach the token limit, with API-M enforcing the new policy as shown in the testing results below:
Run # 0 completed in 1.93 seconds
Consumed tokens: 59
Remaining tokens: 41
Pausing for 15 seconds...
-----------------------------
Run # 1 completed in 0.78 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 2 completed in 0.40 seconds
Response code: 429
Response message: Token limit is exceeded. Try again in 29 seconds.
Pausing for 15 seconds...
-----------------------------
Run # 3 completed in 0.35 seconds
Response code: 429
Response message: Token limit is exceeded. Try again in 14 seconds.
Pausing for 15 seconds...
-----------------------------
Run # 4 completed in 0.91 seconds
Consumed tokens: 55
Remaining tokens: 0
-----------------------------
  1. If you enabled SDK compatibility in Step 2 above, you could use the OpenAI Python SDK to interact with your AOAI models through the API-M endpoint. Here's how to instantiate the AzureOpenAI class with your API-M's subscription key:
client = AzureOpenAI(
    azure_endpoint = APIM_TPM_URL,
    api_key = APIM_TPM_SUB_KEY,
    api_version = AOAI_API_VERSION
)
  1. This enables an OpenAI-compatible interface, with an example Helper function shown below:
def get_sdk_completion(system_prompt, prompt):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]

    response = client.chat.completions.create(
        model = AOAI_DEPLOYMENT,
        messages = messages
    )
    return response
  1. When using the SDK interface, instead of receiving 429 errors, you might observe throttling enforced by API-M, because of the TPM limit policy:
Run # 0 completed in 45.82 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 1 completed in 0.65 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 2 completed in 30.87 seconds
Consumed tokens: 55
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 3 completed in 1.20 seconds
Consumed tokens: 63
Remaining tokens: 0
Pausing for 15 seconds...
-----------------------------
Run # 4 completed in 29.88 seconds
Consumed tokens: 56
Remaining tokens: 0
-----------------------------

Scenario 2: Usage analysis by specific customer

This section describes setting up API-M and then performing end-to-end testing of the token usage collection and visualisation process.

  1. Repeat Steps # 1 and 2 from Scenario 1 above.
  2. After clicking Next, enable "Track token usage" API-M policy. Select an existing Application Insights instance to log token metrics into and add dimensions that you want the metrics to be grouped by: APIM - Enabling Usage policy

Note: The provided Jupyter notebook assumes that you have added Subscription ID as one of the logging dimensions.

  1. Once you click the Create button, a new set of APIs will be provisioned to support interactions with various AOAI models. API-M will also add the token usage metrics policy to all new API operations. Technical aspects of this policy can be found in this reference document:
<policies>
    <inbound>
        <set-backend-service id="apim-generated-policy" backend-id="aoai-usage-by-cx-openai-endpoint" />
        <azure-openai-emit-token-metric namespace="AzureOpenAI">
            <dimension name="Subscription ID" value="@(context.Subscription.Id)" />
        </azure-openai-emit-token-metric>
        <authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
        <base />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>
  1. If you want to log and visualise tokens usage, ensure that you set the following 5 environment variables before running the notebook: APIM - Setting Usage environment variables
Environment Variable Description
APIM_USAGE_AOAI_DEPLOY Name of AOAI deployment
APIM_USAGE_API_VERSION API version of AOAI endpoint
APIM_USAGE_KEY_CONTOSO Subscription key, created for Contoso client
APIM_USAGE_KEY_NORTHWIND Subscription key, created for Northwind client
APIM_USAGE_URL URL of provisioned API-M's API for AOAI endpoint
  1. You can now generate workload with a degree of randomness for both Contoso and Northwind clients, connected to the same Azure OpenAI deployment:
for key in SUBSCRIPTION_KEYS:
    randomness = random.randint(0, 5)
    for i in range(NUMBER_OF_RUNS - randomness):    
        start_time = time.time()
        response = get_rest_completion(subscription_key=key, system_prompt=SYSTEM_PROMPT, user_prompt=USER_PROMPT)
        end_time = time.time()
        print(f"Run # {i} completed in {end_time - start_time:.2f} seconds with response code {response.status_code}")
    
        if i < NUMBER_OF_RUNS - 1:
            print(f"Pausing for {SLEEP_TIME} seconds...")
            time.sleep(SLEEP_TIME)
    print("-----------------------------")
  1. Collected token usage logs can be visualised in Application Insights charts, e.g. the total tokens split by Subscription IDs of Contoso and Northwind as shown below: APIM - Visualising usage stats

Scenario 3: Load-balancing between several AOAI endpoints

This section describes setting up API-M and then performing end-to-end testing of an AOAI load-balancing scenario.

  1. For each backend AOAI endpoint, you can configure circuit breaker logic using API-M's REST API. Such logic determines when to temporarily stop sending requests to an unhealthy endpoint. The provided LoadBalancer_CircuitBreaker.json can be re-used as a jump-start template, where you trip the circuit breaker for 30 seconds if the AOAI endpoint returns 429 (Too Many Requests) or 5xx (server errors) within any 2-second interval.
{
    "properties": {
        "description": "<DESCRIPTION>",
        "title": "<TITLE>",
        "type": "Single",
        "protocol": "http",
        "url": "<URL>",
        "circuitBreaker": {
            "rules": [
                {
                    "failureCondition": {
                        "count": 1,
                        "interval": "PT2S",
                        "statusCodeRanges": [
                            {
                                "min": 429,
                                "max": 429
                            },
                            {
                                "min": 500,
                                "max": 599
                            }
                        ]
                    },
                    "name": "<NAME>",
                    "tripDuration": "PT30S",
                    "acceptRetryAfter": true
                }
            ]
        }
    }
}

Note: At the time of writing, configuring circuit breakers directly within the Azure portal UI for API-M was not supported.

  1. You can combine your backend AOAI endpoints into a load-balancing pool, using either round-robin, weight-based or priority-based logic. The provided LoadBalancer_Pool.json can be re-used as a jump-start template to configure such a pool.
{
    "properties": {
        "description": "<DESCRIPTION>",
        "title": "<TITLE>",
        "type": "Pool",
        "pool": {
            "services": [
                {
                    "id": "<BACKEND_1>",
                    "priority": 1
                },
                {
                    "id": "<BACKEND_2>",
                    "priority": 2
                }
            ]
        }
    }
}

Note: At the time of writing, configuring load-balancing pools directly within the Azure portal UI for API-M was not supported.

  1. If you want to test load-balancing between your defined AOAI endpoints, ensure that you set the following 4 environment variables before running the provided Jupyter notebook: APIM - Setting Usage environment variables
Environment Variable Description
APIM_LB_AOAI_DEPLOY Name of AOAI deployment
APIM_LB_API_VERSION API version of AOAI endpoint
APIM_LB_SUB_KEY Subscription key, created for load-balancing API-M endpoint
APIM_LB_URL URL of load-balancing API-M's API for AOAI endpoint
  1. Consider a use-case where you configure an AOAI deployment of GPT-4o in Sweden Central with an ultra-low Token-per-Minute (TPM) quota of 1K. You then load-balance it with a higher TPM quota GPT-4 deployment in France Central. Your test results might be similar to what is shown below, with successful routing to the France Central endpoint when the circuit breaker trips for the Sweden Central endpoint:
Run # 0: Sweden Central, Duration: 0.84, Response Code: 200
Pausing for 2 seconds...
Run # 1: None, Duration: 1.23, Response Code: 503
Pausing for 2 seconds...
Run # 2: France Central, Duration: 2.57, Response Code: 200
Pausing for 2 seconds...
Run # 3: France Central, Duration: 1.94, Response Code: 200
Pausing for 2 seconds...
Run # 4: France Central, Duration: 1.97, Response Code: 200
Pausing for 2 seconds...
Run # 5: France Central, Duration: 2.18, Response Code: 200
Pausing for 2 seconds...
Run # 6: France Central, Duration: 1.72, Response Code: 200
Pausing for 2 seconds...
Run # 7: France Central, Duration: 2.17, Response Code: 200
Pausing for 2 seconds...
Run # 8: France Central, Duration: 1.99, Response Code: 200
Pausing for 2 seconds...
Run # 9: Sweden Central, Duration: 0.88, Response Code: 200

About

Practical use of Azure API Management's AI Gateway to manage Azure OpenAI operations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published