Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Agent auto discovery conflicts with explicit task type mapping #6349

Open
2 tasks done
thomas-maschler opened this issue Mar 19, 2025 · 1 comment
Open
2 tasks done
Assignees
Labels
bug Something isn't working

Comments

@thomas-maschler
Copy link

thomas-maschler commented Mar 19, 2025

Describe the bug

When using agent canary deployments requests for all task_types registered with the agent service will be routed to that service, even if there is an explicit mapping of agentForTaskTypes in the plugin config. When developing agents using flytekit, the sensor and webhook agent are always installed and registered with the service, and there is no way to unregister them other than patching the flytekit code when building the Docker image.

If the service account for the canary agent service has different permissions or the agent service has different configurations such as create/get timeouts etc, calling the sensor or webhook service might cause unexpected behavior and/or fail.

The task_type auto-discovery in combination with ignoring the explicit mapping also complicates agent development. Developers have to be mindful of which agents are installed on the image and cannot simply rely on the mapping in the configuration.

Expected behavior

Option A:

  • Respect the explicit task_type mapping over the implicit (auto-discovery) mapping. If the config lists only selected task_types for an agent service only route those task_types to that agent

Option B:

  • Don't allow to explicitly map task_types and only do implicit (auto-discovery) task_type mapping. In that case developers need a way to not register default agents such as sensor and webhook so that they are in full control over what task_types are supported by their service.

Additional context to reproduce

Develop an agent

agent.py

from flytekit.extend.backend.base_agent import (
    AsyncAgentBase,
    ResourceMeta,
)

class AsciBaseAgent(AsyncAgentBase):
    def __init__(self) -> None:
        super().__init__(task_type_name="test_task", metadata_type=ResourceMeta)
    
    def create(self, *args, **kwargs):
        pass
    
    def get(self, *args, **kwargs):
        pass

    def delete(self, *args, **kwargs):
        pass

init.py

from flytekit.extend.backend.base_agent import (
    AgentRegistry,
)

from flyte_agent_auto_discovery.agent import AsciBaseAgent
AgentRegistry.register(AsciBaseAgent())

pyproject.toml

[project]
authors = []
dependencies = [
     "flytekit>=1.15.3,<2", 
     "prometheus_client>=0.21.1,<0.22",
     "grpcio-health-checking>=1.62.2,<2",
     "httpx>=0.28.1,<0.29"]
name = "flyte-agent-auto-discovery"
requires-python = ">= 3.11"
version = "0.1.0"

# install the Flyte agent as a flytekit plugin
[project.entry-points."flytekit.plugins"]
test_agent = "flyte_agent_auto_discovery"

Test locally. This shows that the sensor and webhook agents are always installed

pyflyte serve agent

🚀 Starting the agent service...
Starting up the server to expose the prometheus metrics...
                  Agent Metadata                   
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Agent Name       ┃ Support Task Types ┃ Is Sync ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ Sensor           │ sensor (v0)        │ False   │
│ Base Async Agent │ test_task (v0)    │ False   │
│ Webhook Agent    │ webhook (v0)       │ True    │
└──────────────────┴────────────────────┴─────────┘

Now configure the cluster

plugins:
  agent-service:
    # By default, all requests will be sent to the default agent.
    defaultAgent:
      endpoint: "k8s://flyteagent.flyte:8000"
      insecure: true
      timeouts:
        # CreateTask, GetTask and DeleteTask are for async agents.
        # ExecuteTaskSync is for sync agents.
        CreateTask: 5s
        GetTask: 5s
        DeleteTask: 5s
        ExecuteTaskSync: 10s
      defaultTimeout: 10s
    agents:
      test_agent:
        endpoint: "dns:///test-flyteagent.flyte.svc.cluster.local:8000"
        insecure: false
        defaultServiceConfig: '{"loadBalancingConfig": [{"round_robin":{}}]}'
        timeouts:
          GetTask: 5s
        defaultTimeout: 10s
    agentForTaskTypes:
      # It will override the default agent for custom_task, which means propeller will send the request to this agent.
      - test_agent: test_agent
      - sensor: defaultAgent
      - webhook: defaultAgent

Build a flyte workflow and make a call to the sensor or webhook task. It will be routed to the canary agent

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@thomas-maschler thomas-maschler added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 19, 2025
@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 27, 2025
@pingsutw
Copy link
Member

Respect the explicit task_type mapping over the implicit (auto-discovery) mapping. If the config lists only selected task_types for an agent service only route those task_types to that agent

Agree with you. we should always override the agent registry here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

3 participants