Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New heartbeat automation marks running flow as Crashed #16459

Open
bkempe opened this issue Dec 19, 2024 · 2 comments
Open

New heartbeat automation marks running flow as Crashed #16459

bkempe opened this issue Dec 19, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@bkempe
Copy link

bkempe commented Dec 19, 2024

Bug summary

I've set up the new automation to detect and respond to zombie flows.

However, the automation marked a running flow run as Crashed despite a recent heartbeat event.
The flow run continued to run, sent more heartbeat events, and got into Complete state.

Version info

Version: 3.1.8
Python: 3.12.8

Additional context

PREFECT_RUNNER_HEARTBEAT_FREQUENCY set to 31.

Automation trigger:

{
  "type": "event",
  "match": {
    "prefect.resource.id": [
      "prefect.flow-run.*"
    ]
  },
  "match_related": {},
  "after": [
    "prefect.flow-run.heartbeat"
  ],
  "expect": [
    "prefect.flow-run.Crashed",
    "prefect.flow-run.Failed",
    "prefect.flow-run.heartbeat",
    "prefect.flow-run.Completed",
    "prefect.flow-run.Cancelled"
  ],
  "for_each": [
    "prefect.resource.id"
  ],
  "posture": "Proactive",
  "threshold": 1,
  "within": 90
}

Logs

prefect.flow-run.Scheduled  06:02:07 PM
Worker 'KubernetesWorker 4c653bcf-cd8e-4b36-b9f5-03d1b53556cb' submitting flow run '117fa91f-8d45-4019-815b-32c0221713c3'  06:02:08 PM
Running on worker id: ddb8e354-cd55-4c77-bb2e-d67c24f77bcd. See worker logs here: https://app.prefect.cloud/account/f230f7a2-7128-44f5-ba98-3ac534612f5d/workspace/d212164e-16e9-4fd0-9fce-34f9fb0f953d/work-pools/work-pool/Kubernetes Production/worker/ddb8e354-cd55-4c77-bb2e-d67c24f77bcd  06:02:08 PM
prefect.flow-run.Pending  06:02:08 PM
prefect.worker.submitted-flow-run  06:02:08 PM
Creating Kubernetes job...  06:02:08 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Starting watch for pod start...  06:02:09 PM
Completed submission of flow run '117fa91f-8d45-4019-815b-32c0221713c3'  06:02:09 PM
prefect.kubernetes.pod.pending  06:02:09 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh-4n52x' has started.  06:02:09 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod has status 'Pending'.  06:02:09 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh-4n52x' has started.  06:02:09 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh-4n52x' has started.  06:02:09 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh-4n52x' has started.  06:02:12 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh-4n52x' has started.  06:02:13 PM
prefect.kubernetes.pod.running  06:02:14 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh-4n52x' has started.  06:02:14 PM
Job 'my-job-4e5a232d-0acf-446d-9ff8-3-cs8kh': Pod has status 'Running'.  06:02:14 PM
Opening process...  06:02:23 PM
prefect.flow-run.heartbeat  06:02:23 PM
Downloading flow code from storage at '.'  06:02:29 PM
prefect.flow-run.Running  06:02:31 PM
prefect.flow-run.heartbeat  06:02:48 PM
prefect.flow-run.heartbeat  06:03:12 PM
prefect.flow-run.heartbeat  06:03:46 PM
prefect.flow-run.heartbeat  06:04:23 PM
prefect.flow-run.heartbeat  06:05:03 PM
prefect.flow-run.heartbeat  06:05:27 PM
prefect.flow-run.heartbeat  06:06:04 PM
prefect.flow-run.heartbeat  06:06:36 PM
prefect.flow-run.heartbeat  06:07:01 PM
prefect.flow-run.heartbeat  06:07:26 PM
prefect.flow-run.heartbeat  06:08:05 PM
prefect.flow-run.heartbeat  06:08:38 PM
prefect.flow-run.heartbeat  06:09:05 PM
prefect.flow-run.heartbeat  06:09:33 PM
prefect.flow-run.heartbeat  06:10:04 PM
prefect.flow-run.heartbeat  06:10:40 PM
prefect.flow-run.heartbeat  06:11:16 PM
prefect.flow-run.heartbeat  06:11:45 PM
prefect.flow-run.heartbeat  06:12:22 PM
prefect.flow-run.heartbeat  06:12:53 PM
prefect.flow-run.heartbeat  06:13:19 PM
prefect.flow-run.heartbeat  06:13:54 PM
prefect.flow-run.heartbeat  06:14:33 PM
prefect.flow-run.heartbeat  06:15:09 PM
prefect.flow-run.heartbeat  06:15:47 PM
prefect.flow-run.heartbeat  06:16:14 PM
prefect.flow-run.heartbeat  06:16:40 PM
prefect.flow-run.heartbeat  06:17:04 PM
prefect.flow-run.heartbeat  06:17:43 PM
prefect.flow-run.heartbeat  06:18:12 PM
prefect.flow-run.heartbeat  06:18:36 PM
prefect.flow-run.heartbeat  06:19:05 PM
prefect.flow-run.heartbeat  06:19:29 PM
prefect.flow-run.heartbeat  06:19:55 PM
prefect.flow-run.heartbeat  06:20:21 PM
prefect.flow-run.heartbeat  06:20:56 PM
prefect.flow-run.heartbeat  06:21:21 PM
prefect.flow-run.heartbeat  06:21:45 PM
prefect.flow-run.heartbeat  06:22:21 PM
prefect.flow-run.heartbeat  06:22:46 PM
prefect.flow-run.heartbeat  06:23:10 PM
prefect.flow-run.heartbeat  06:23:38 PM
prefect.flow-run.heartbeat  06:24:11 PM
prefect.flow-run.heartbeat  06:24:48 PM
prefect.flow-run.heartbeat  06:25:26 PM
prefect.flow-run.heartbeat  06:26:06 PM
prefect.flow-run.heartbeat  06:26:37 PM
prefect.flow-run.heartbeat  06:27:15 PM
prefect.flow-run.heartbeat  06:27:42 PM
prefect.flow-run.heartbeat  06:28:13 PM
prefect.flow-run.heartbeat  06:28:44 PM
prefect.flow-run.heartbeat  06:29:15 PM
prefect.flow-run.heartbeat  06:29:39 PM
prefect.flow-run.heartbeat  06:30:10 PM
prefect.flow-run.heartbeat  06:30:35 PM
prefect.flow-run.heartbeat  06:31:09 PM
prefect.flow-run.heartbeat  06:31:42 PM
prefect.flow-run.heartbeat  06:32:17 PM
prefect.flow-run.heartbeat  06:32:51 PM
prefect.flow-run.heartbeat  06:33:29 PM
prefect.flow-run.heartbeat  06:34:00 PM
prefect.flow-run.heartbeat  06:34:28 PM
prefect.flow-run.heartbeat  06:34:54 PM
prefect.flow-run.heartbeat  06:35:27 PM
prefect.flow-run.heartbeat  06:36:02 PM
prefect.flow-run.heartbeat  06:36:28 PM
prefect.flow-run.heartbeat  06:37:00 PM
prefect.flow-run.heartbeat  06:37:33 PM
prefect.flow-run.heartbeat  06:38:05 PM
prefect.flow-run.heartbeat  06:38:38 PM
prefect.flow-run.heartbeat  06:39:19 PM
prefect.flow-run.heartbeat  06:39:44 PM
prefect.flow-run.heartbeat  06:40:09 PM
prefect.flow-run.heartbeat  06:40:43 PM
prefect.flow-run.heartbeat  06:41:06 PM
prefect.flow-run.heartbeat  06:41:31 PM
prefect.flow-run.heartbeat  06:41:59 PM
prefect.flow-run.heartbeat  06:42:27 PM
prefect.flow-run.heartbeat  06:42:55 PM
prefect.flow-run.heartbeat  06:43:27 PM
prefect.flow-run.heartbeat  06:43:52 PM
prefect-cloud.automation.action.triggered  06:43:53 PM
prefect.flow-run.Crashed  06:43:54 PM
prefect-cloud.automation.action.triggered  06:43:54 PM
prefect-cloud.automation.action.triggered  06:43:54 PM
prefect-cloud.automation.action.executed  06:43:54 PM
prefect-cloud.automation.action.triggered  06:43:54 PM
prefect-cloud.automation.action.failed  06:43:54 PM
prefect-cloud.automation.action.triggered  06:43:54 PM
prefect-cloud.automation.action.triggered  06:43:54 PM
prefect-cloud.automation.action.failed  06:43:54 PM
prefect-cloud.automation.action.failed  06:43:54 PM
prefect-cloud.automation.action.failed  06:43:55 PM
prefect-cloud.automation.action.triggered  06:43:55 PM
prefect-cloud.automation.action.failed  06:43:55 PM
prefect-cloud.automation.action.failed  06:43:55 PM
prefect.flow-run.heartbeat  06:44:24 PM
prefect.flow-run.heartbeat  06:44:51 PM
prefect.flow-run.heartbeat  06:45:16 PM
prefect.flow-run.heartbeat  06:45:42 PM
prefect.flow-run.heartbeat  06:46:13 PM
prefect.flow-run.heartbeat  06:46:37 PM
prefect.flow-run.Completed  06:46:41 PM
Finished in state Completed()  06:46:41 PM
Process for flow run 'my-job-4e5a232d-0acf-446d-9ff8-3fd5158d7413-2024-12-19_17-02-06' exited cleanly.  06:46:43 PM
prefect.worker.executed-flow-run  06:46:57 PM

Details for event 06:43:52 PM prefect-cloud.automation.action.triggered:

{
  "id": "c72e5b22-fc0b-4eca-9704-4fa53b118b6d",
  "account": "f230f7a2-7128-44f5-ba98-3ac534612f5d",
  "event": "prefect-cloud.automation.action.triggered",
  "occurred": "2024-12-19T17:43:53.962Z",
  "payload": {
    "action_index": 0,
    "action_type": "change-flow-run-state",
    "invocation": "32e4bae6-19a4-4977-8fe8-206b4de2fd36"
  },
  "received": "2024-12-19T17:43:54.203Z",
  "related": [
    {
      "prefect.resource.id": "prefect.flow-run.117fa91f-8d45-4019-815b-32c0221713c3",
      "prefect.resource.role": "target"
    }
  ],
  "resource": {
    "prefect.resource.id": "prefect-cloud.automation.905058bc-c58b-4659-b04b-6885a0567043",
    "prefect.resource.name": "Crash zombie flows",
    "prefect-cloud.trigger-type": "event",
    "prefect-cloud.posture": "Proactive"
  },
  "workspace": "d212164e-16e9-4fd0-9fce-34f9fb0f953d"
}

Details for event 06:43:54 PM prefect-cloud.automation.action.failed:

{
  "id": "d2be71dc-c61a-49cd-baa1-dde7c2456d77",
  "account": "f230f7a2-7128-44f5-ba98-3ac534612f5d",
  "event": "prefect-cloud.automation.action.failed",
  "occurred": "2024-12-19T17:43:54.264Z",
  "payload": {
    "action_index": 0,
    "action_type": "change-flow-run-state",
    "invocation": "14560e8a-090c-4533-8ac1-2faf42378353",
    "reason": "Failed to set state: Run is already in terminal state CRASHED.",
    "status_code": 200
  },
  "received": "2024-12-19T17:43:54.264Z",
  "related": [
    {
      "prefect.resource.id": "prefect.flow-run.117fa91f-8d45-4019-815b-32c0221713c3",
      "prefect.resource.role": "target"
    }
  ],
  "resource": {
    "prefect.resource.id": "prefect-cloud.automation.905058bc-c58b-4659-b04b-6885a0567043",
    "prefect.resource.name": "Crash zombie flows",
    "prefect-cloud.trigger-type": "event",
    "prefect-cloud.posture": "Proactive"
  },
  "workspace": "d212164e-16e9-4fd0-9fce-34f9fb0f953d"
}
@desertaxle
Copy link
Member

Thanks for the bug report @bkempe! It looks like there are several heartbeat events clustered together in your log. Did you have multiple flow runs or running at the same time?

@bkempe
Copy link
Author

bkempe commented Dec 19, 2024

@desertaxle Sorry, something went wrong the first time I copy&pasted the event log. I fixed it above. Do any events still look clustered?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants