Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring Serverless Workloads with OpenTelemetry and Prometheus #21

Merged
merged 3 commits into from
Jul 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions talks/monitoring-serverless-workloads-opentelemetry-prometheus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: "Monitoring Serverless Workloads with OpenTelemetry and Prometheus"
labels: "cloud native, monitoring, observability, opentelemetry, prometheus, serverless"
year: 2024
---

# Monitoring Serverless Workloads with OpenTelemetry and Prometheus

**Speaker(s):** Ridwan Sharif, Google

**Video URL:** https://www.youtube.com/watch?v=r6qb3OivFyQ

**Summary:** The talk explains the challenges of observability in serverless environments, such as the transient nature of instances. To address these issues, Google introduced sidecars in Cloud Run to run alongside main containers for collecting telemetry data, and they implemented lifecycle dependencies to ensure proper telemetry collection during startup and shutdown. Adjustments were also made to support both push-based (e.g., OpenTelemetry) and pull-based (e.g., Prometheus) telemetry systems in serverless environments.

## Timestamps

- `00:00` - Introduction and Agenda
- `02:00` - Focus on Serverless
- `05:00` - Serverless Observability (O11y)
- `07:25` - Problem: serverless instances are expected to be short-lived
- `08:20` - Problem: the need to normalize cumulative metrics
- `08:50` - Problem: prometheus-based scraping expect metrics to be pulled
- `10:45` - Problem: need more control of how to configure collections
- `11:40` - Cloud Run infrastructure changes needed to accommodate o11y
- `14:35` - Adjustments needed for push-based OpenTelemetry Protocol (OTLP) ingest
- `15:05` - Adjustments needed for pull-based ingest (Prometheus metrics)
- `17:25` - How the productionize (deploy and configure) the collectors
- `19:00` - Next steps

## Key Takeaways

- Traditional monitoring tools face difficulties due to the ephemeral nature of serverless instances.
- Google introduced sidecars and lifecycle dependencies in Cloud Run to improve observability.
- Ongoing efforts include reducing resource overhead and addressing CPU throttling issues.

### Questions/Discussion Points

- These were the changes done in Cloud Run to accommodate o11y - what about other serverless services?