Skip to content

feat(BA-1213): Add detection and event notifications for kernel/container mismatches #4252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

fregataa
Copy link
Member

@fregataa fregataa commented Apr 22, 2025

resolves #4242 (BA-1213)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue

@fregataa fregataa added this to the 25Q1 milestone Apr 22, 2025
@fregataa fregataa self-assigned this Apr 22, 2025
@fregataa fregataa requested a review from HyeockJinKim April 22, 2025 13:34
@github-actions github-actions bot added size:L 100~500 LoC comp:manager Related to Manager component comp:agent Related to Agent component comp:common Related to Common component labels Apr 22, 2025
@fregataa fregataa changed the title feat(BA-1213): Produce kernel/container dangling events feat(BA-1213): Add detection and event notifications for kernel/container mismatches Apr 22, 2025
for private_port, host_ports in src["NetworkSettings"]["Ports"].items():
private_port = int(private_port.split("/")[0])
if host_ports is None:
host_ip = "127.0.0.1"

Check notice

Code scanning / devskim

Accessing localhost could indicate debug code, or could hinder scaling. Note

Do not leave debug code in production
version: int
agent_config: Mapping[str, Any]
resource_spec: KernelResourceSpec
service_ports: Any # TODO: type-annotation

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@github-actions github-actions bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels Apr 28, 2025
@fregataa fregataa marked this pull request as ready for review April 28, 2025 14:39
async def handle_dangling_kernel(
context: AgentRegistry, source: AgentId, event: DanglingKernelDetected
) -> None:
# TODO: Impl dangling kernel handler

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
async def handle_dangling_container(
context: AgentRegistry, source: AgentId, event: DanglingContainerDetected
) -> None:
# TODO: Impl dangling container handler

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
Comment on lines 1423 to 1429
def _get_probe_runner(self) -> ProbeRunner:
probe = AgentProbe(
self.enumerate_containers,
self.get_kernel_registry,
self.event_producer,
)
return ProbeRunner(11.0, [probe])
Copy link
Collaborator

@HyeockJinKim HyeockJinKim Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't make new object in getter.

Comment on lines -49 to +53
network_id: str,
image: ImageRef,
version: int,
args: KernelInitArgs,
network_driver: str,
*,
agent_config: Mapping[str, Any],
resource_spec: KernelResourceSpec,
service_ports: Any, # TODO: type-annotation
environ: Mapping[str, Any],
data: Dict[str, Any],
) -> None:
super().__init__(
ownership_data,
network_id,
image,
version,
agent_config=agent_config,
resource_spec=resource_spec,
service_ports=service_ports,
data=data,
environ=environ,
)
super().__init__(args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 33 to 34
def _get_probe_runner(self) -> ProbeRunner:
return ProbeRunner.nop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having private methods in common, it seems better to have interfaces injected.

Comment on lines 37 to 67
class BaseKernelProbe(ABC):
def __init__(
self,
kernel_id: KernelId,
kernel_state_getter: Callable[..., KernelLifecycleStatus],
container_id_getter: Callable[..., Optional[ContainerId]],
event_producer: EventProducer,
) -> None:
self._kernel_id = kernel_id
self._container_id_getter = container_id_getter
self._kernel_state_getter = kernel_state_getter
self._event_producer = event_producer

@abstractmethod
async def _get_container_info(self) -> Optional[Container]:
raise NotImplementedError

def _compare_with_container(self, container: Optional[Container]) -> None:
kernel_state = self._kernel_state_getter()
match kernel_state:
case KernelLifecycleStatus.PREPARING:
if container is not None:
# container exists but kernel is hanging in PREPARING state
raise DanglingKernel
case KernelLifecycleStatus.RUNNING:
if container is None or container.status != ContainerStatus.RUNNING:
raise DanglingKernel
case KernelLifecycleStatus.TERMINATING:
# There might be a delay in the container status change
# after the kernel is being terminated.
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I'd like to keep the implementation out of ABC.

@fregataa fregataa requested a review from HyeockJinKim May 3, 2025 08:49
@fregataa fregataa modified the milestones: 25.6, 25Q2 May 6, 2025
Comment on lines +289 to +291
@abstractmethod
def _init_probe_runner_obj(self) -> ProbeRunner:
raise NotImplementedError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think private abstractmethod is harmful.

agent_config: Mapping[str, Any]
resource_spec: KernelResourceSpec
service_ports: Any # TODO: type-annotation
data: dict[Any, Any]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the type of key as possible.

pass


class ProbeRunner(Generic[TResourceCtx]):
Copy link
Collaborator

@HyeockJinKim HyeockJinKim May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Probe prefix. but Runner is so generic name. please rename proper name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trigger Kernel/Container Dangling Events
2 participants