Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bloom-builder: "panic: duplicate metrics collector registration attempted" #14083

Closed
diranged opened this issue Sep 9, 2024 · 11 comments · Fixed by #15994
Closed

bloom-builder: "panic: duplicate metrics collector registration attempted" #14083

diranged opened this issue Sep 9, 2024 · 11 comments · Fixed by #15994
Labels
feature/blooms type/bug Somehing is not working as expected

Comments

@diranged
Copy link

diranged commented Sep 9, 2024

Describe the bug
Trying to use the new bloom-builder and bloom-planner components introduced by @chaudum in #14003 - but even after we create the /var/loki volume (see #14082), we are running into the builder crashing on startup with this error:

level=info ts=2024-09-09T17:23:51.681576259Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.1.x-89fe788, branch=release-3.1.x, revision=89fe788d)"
level=info ts=2024-09-09T17:23:51.681628661Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2024-09-09T17:23:51.681644678Z caller=modules.go:748 component=bloomstore msg="no metas cache configured"
level=info ts=2024-09-09T17:23:51.681730499Z caller=blockscache.go:420 component=bloomstore msg="run ttl evict job"
level=info ts=2024-09-09T17:23:51.681753203Z caller=blockscache.go:380 component=bloomstore msg="run lru evict job"
level=info ts=2024-09-09T17:23:51.681816379Z caller=blockscache.go:365 component=bloomstore msg="run metrics collect job"
level=info ts=2024-09-09T17:23:51.686655187Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
panic: duplicate metrics collector registration attempted

goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x4737440, {0x4000f04610?, 0x0?, 0x0?})
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/registry.go:405 +0x78
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounter({{0x2fadad0?, 0x4737440?}}, {{0x25ded97, 0x4}, {0x0, 0x0}, {0x260e2dc, 0x14}, {0x261d67e, 0x18}, ...})
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:265 +0x128
github.com/grafana/loki/v3/pkg/storage/bloom/v1.NewMetrics({0x2fadad0, 0x4737440})
	/src/loki/pkg/storage/bloom/v1/metrics.go:62 +0x7c
github.com/grafana/loki/v3/pkg/bloombuild/builder.New({{0x6400000, 0x6400000, {0x0, 0x0}, 0x0, 0x0, 0x0, {0x5f5e100, 0x2540be400, 0xa}, ...}, ...}, ...)
	/src/loki/pkg/bloombuild/builder/builder.go:65 +0x154
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomBuilder(0x4000fe3008)
	/src/loki/pkg/loki/modules.go:1586 +0x2b4
github.com/grafana/dskit/modules.(*Manager).initModule(0x40000d88e8, {0xffffc8fbdc0a, 0xd}, 0x4001b78fe8, 0x4000b72b70)
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x194
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x40000d88e8, {0x4000c5cb10, 0x1, 0x1?})
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0x4000fe3008, {0x0?, {0x4?, 0x2?, 0x4737aa0?}})
	/src/loki/pkg/loki/loki.go:458 +0x74
main.main()
	/src/loki/cmd/loki/main.go:129 +0x10ac
@chaudum
Copy link
Contributor

chaudum commented Sep 11, 2024

Hi @diranged Did you run Loki using the vanilla Helm chart?

I got a different panic, see #14110 but could not reproduce the duplicate metrics registration.

@chaudum
Copy link
Contributor

chaudum commented Sep 12, 2024

I am able to reproduce this state now. Loki built from main does not have this issue, so needs to be fixed on the release-3.1.x branch only.

@diranged
Copy link
Author

Thank you for working to reproduce the issue!

@fculpo
Copy link

fculpo commented Oct 7, 2024

Hi, same issue on Helm loki@6.16.0

@JStickler JStickler added feature/blooms type/bug Somehing is not working as expected labels Oct 29, 2024
@vladst3f
Copy link

getting this panic now on the latest main-aec8e96 and k236-with-agg-metric-payload-fix-c5bd2ad tags.

@vladst3f
Copy link

vladst3f commented Jan 17, 2025

on k237:

level=debug ts=2025-01-17T20:44:08.946008947Z caller=index_set.go:316 table-name=loki_index_tsdb_20104 user-id=fake msg="syncing files for table loki_index_tsdb_20104"
panic: duplicate metrics collector registration attempted
goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x6b67fe0, {0xc0014a8420?, 0x3f59e1f?, 0x14?})
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/registry.go:406 +0x66
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounterVec({{0x46c7450?, 0x6b67fe0?}}, {{0x3f215d7, 0x4}, {0x3f59e1f, 0x14}, {0x3f41a0a, 0xe}, {0x3ffa55d, 0x32}, ...}, ...)
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:276 +0x163
github.com/grafana/loki/v3/pkg/bloomgateway.newClientMetrics({0x46c7450, 0x6b67fe0})
	/src/loki/pkg/bloomgateway/metrics.go:33 +0x9b
github.com/grafana/loki/v3/pkg/bloomgateway.NewClient({{0x37e11d600}, {0x6400000, 0x6400000, {0x0, 0x0}, 0x0, 0x0, 0x0, {0x5f5e100, 0x2540be400, ...}, ...}, ...}, ...)
	/src/loki/pkg/bloomgateway/client.go:147 +0xd4
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomBuilder(0xc0018c4000)
	/src/loki/pkg/loki/modules.go:1678 +0x72c
github.com/grafana/dskit/modules.(*Manager).initModule(0xc00104de90, {0x7ffc71a5079d, 0x7}, 0xc001ca9848, 0xc0012ba600)
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1ea
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0xc00104de90, {0xc001031090, 0x1, 0x7510c18f88e2e5ce?})
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xe8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc0018c4000, {0x0?, {0x4?, 0x2?, 0x6b68760?}})
	/src/loki/pkg/loki/loki.go:491 +0x97
main.main()
	/src/loki/cmd/loki/main.go:129 +0x1305

@chaudum
Copy link
Contributor

chaudum commented Jan 24, 2025

@vladst3f Could you post your Loki config.yaml? What -target= do you run?

@vladst3f
Copy link

@chaudum, it's a SSD deployment, and it panics on the backend pods.
the requested config of this lab where I tested the upgrade from 3.3.0 out is:

config.yaml: |

    analytics:
      reporting_enabled: false
    auth_enabled: false
    bloom_build:
      builder:
        planner_address: loki-backend-headless.observability.svc.cluster.local:9095
      enabled: true
      planner:
        max_table_offset: 7
        planning_interval: 2h
        queue:
          max_queued_tasks_per_tenant: 300000
        retention:
          enabled: true
    bloom_gateway:
      block_query_concurrency: 12
      client:
        addresses: dns+loki-backend-headless.observability.svc.cluster.local:9095
      enabled: true
      max_outstanding_per_tenant: 10240
      num_multiplex_tasks: 512
      worker_concurrency: 6
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_buffer: 500000
          writeback_goroutines: 1
          writeback_size_limit: 500MB
        default_validity: 18h
        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.observability.svc
          consistent_hash: true
          max_idle_conns: 72
          timeout: 2000ms
    common:
      compactor_address: 'http://loki-backend:3100'
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          access_key_id: ${S3_ACCESS_KEY}
          bucketnames: ${S3_MULTIBUCKET}
          endpoint: ${S3_ENDPOINT}
          http_config:
            insecure_skip_verify: true
          insecure: false
          region: eu-west-1
          s3forcepathstyle: true
          secret_access_key: ${S3_SECRET_ACCESS_KEY}
    compactor:
      compaction_interval: 5m
      delete_batch_size: 2100
      delete_request_cancel_period: 10m
      delete_request_store: s3-multibucket
      max_compaction_parallelism: 2
      retention_delete_worker_count: 300
      retention_enabled: true
      upload_parallelism: 20
    frontend:
      log_queries_longer_than: 10s
      max_outstanding_per_tenant: 4096
      scheduler_address: ""
      tail_proxy_url: ""
    frontend_worker:
      scheduler_address: ""
    index_gateway:
      mode: simple
    ingester:
      chunk_encoding: snappy
      chunk_target_size: 4194304
      flush_op_timeout: 10m
      max_chunk_age: 168h
      wal:
        enabled: false
    limits_config:
      allow_structured_metadata: true
      bloom_creation_enabled: true
      bloom_gateway_enable_filtering: true
      cardinality_limit: 1000000
      discover_service_name:
      - service_name
      - job
      ingestion_burst_size_mb: 300
      ingestion_rate_mb: 200
      ingestion_rate_strategy: local
      max_cache_freshness_per_query: 5m
      max_entries_limit_per_query: 50000
      max_global_streams_per_user: 0
      max_line_size: 0
      max_querier_bytes_read: 0
      max_query_parallelism: 64
      max_query_series: 20000
      max_streams_matchers_per_query: 5000
      per_stream_rate_limit: 200MB
      per_stream_rate_limit_burst: 500MB
      query_timeout: 5m
      reject_old_samples: false
      reject_old_samples_max_age: 168h
      retention_period: 744h
      shard_streams:
        enabled: false
      split_queries_by_interval: 15m
      tsdb_max_query_parallelism: 300
      tsdb_sharding_strategy: bounded
      unordered_writes: true
      volume_enabled: true
    memberlist:
      cluster_label: loki
      join_members:
      - loki-memberlist
    pattern_ingester:
      enabled: true
    querier:
      max_concurrent: 10
      query_ingesters_within: 169h
    query_range:
      align_queries_with_step: true
      cache_results: true
      parallelise_shardable_queries: true
      results_cache:
        cache:
          background:
            writeback_buffer: 500000
            writeback_goroutines: 1
            writeback_size_limit: 500MB
          default_validity: 12h
          memcached_client:
            addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.observability.svc
            consistent_hash: true
            timeout: 500ms
            update_interval: 1m
    query_scheduler:
      max_outstanding_requests_per_tenant: 32768
    ruler:
      alertmanager_url: http://prometheus-alertmanager-headless.monitoring-system.svc.cluster.local:9093/
      enable_alertmanager_v2: true
      enable_api: true
      enable_sharding: true
      evaluation:
        mode: remote
        query_frontend:
          address: dns:///loki-read.observability.svc.cluster.local.:9095
      external_url: 'REDACTED'
      remote_write:
        clients:
          prometheusReplica0:
            queue_config:
              capacity: 10000
              retry_on_http_429: true
            url: http://kps-prometheus-replica-0.monitoring-system.svc.cluster.local:9090/api/v1/write
          prometheusReplica1:
            queue_config:
              capacity: 10000
              retry_on_http_429: true
            url: http://kps-prometheus-replica-1.monitoring-system.svc.cluster.local:9090/api/v1/write
        enabled: true
      ring:
        kvstore:
          store: inmemory
      rule_path: /var/loki/scratch
      sharding_algo: by-rule
      storage:
        local:
          directory: /var/loki/rules
        type: local
      wal:
        dir: /var/loki/wal
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2023-05-01"
        index:
          period: 24h
          prefix: loki_index_tsdb_
        object_store: s3
        row_shards: 32
        schema: v12
        store: tsdb
      - from: "2023-11-29"
        index:
          period: 24h
          prefix: loki_index_tsdb_
        object_store: s3-multibucket
        row_shards: 32
        schema: v12
        store: tsdb
      - from: "2024-05-09"
        index:
          period: 24h
          prefix: loki_index_tsdb_
        object_store: s3-multibucket
        row_shards: 32
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      grpc_server_max_concurrent_streams: 2000
      grpc_server_max_recv_msg_size: 90971520
      grpc_server_max_send_msg_size: 90971520
      http_listen_port: 3100
      http_server_idle_timeout: 20m
      http_server_read_timeout: 10m
      http_server_write_timeout: 10m
      log_level: debug
    storage_config:
      bloom_shipper:
        working_directory: /var/loki/data/blooms
      boltdb_shipper:
        index_gateway_client:
          server_address: ""
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      named_stores:
        aws:
          s3-multibucket:
            access_key_id: ${S3_ACCESS_KEY}
            bucketnames: ${S3_MULTIBUCKET}
            endpoint: ${S3_ENDPOINT}
            http_config:
              insecure_skip_verify: true
            region: eu-west-1
            s3forcepathstyle: true
            secret_access_key: ${S3_SECRET_ACCESS_KEY}
      tsdb_shipper:
        index_gateway_client:
          server_address: dns+loki-backend-headless.observability.svc.cluster.local:9095
    tracing:
      enabled: false

@vladst3f
Copy link

hi @chaudum - were you able to reproduce ? I can try some things out if you think this might be a configuration issue.

chaudum added a commit that referenced this issue Jan 29, 2025
The bloom gateway client is used both in the bloom builder and in the
index gateways. When running Loki in SSD mode, both services are part of
the `backend` target, and therefore the client is initialised twice,
leading to duplicate metrics registration and a subsequent panic.

This commit extracts the initialisation of the bloom gateway client into
a separate service that is started once and becomes a dependency of both
bloom builder and index gateway.

Ref: #14083

Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
@chaudum
Copy link
Contributor

chaudum commented Jan 29, 2025

@vladst3f I was able to reproduce the issue and pushed a fix #15994

@vladst3f
Copy link

cheers @chaudum, much appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/blooms type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants