Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pushing a ruby application fails with DNS server misbehaving Ubuntu Noble stemcell #987

Open
Tracked by #1224
davewalter opened this issue Jan 23, 2025 · 8 comments
Labels

Comments

@davewalter
Copy link
Member

Current behavior

The noble-stemcell-validation pipeline is failing to stage an application with the Ruby buildpack when running the CF Acceptance tests:

  [2025-01-20 09:24:59.76 (UTC)]> cf push CATS-3-BRKR-031938ce78aeb3aa -b ruby_buildpack -m 256M -p assets/service_broker --health-check-type http --endpoint /v2/catalog 
  Pushing app CATS-3-BRKR-031938ce78aeb3aa to org CATS-3-ORG-26d97ea0e18920fe / space CATS-3-SPACE-fd8ffc94232fef3a as CATS-3-USER-d8b3e9e08c300515...
  Packaging files to upload...
  Uploading files...
 269.15 KiB / 269.15 KiB [==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================] 100.00% 1s

  Waiting for API to complete processing files...

  Staging app and tracing logs...
     Downloading ruby_buildpack...
     Downloaded ruby_buildpack
     Cell a6736615-7775-4211-9041-f79ce23c5668 creating container for instance 8d5cf9a5-cded-4773-a925-74363d5b2e30
     Security group rules were updated
     Cell a6736615-7775-4211-9041-f79ce23c5668 successfully created container for instance 8d5cf9a5-cded-4773-a925-74363d5b2e30
     Downloading app package...
     Downloaded app package (956.7K)
     -----> Ruby Buildpack version 1.10.19
     -----> Bootstrapping Ruby
     -----> Installing ruby 3.2.5
     Download [https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz]
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 1.333232717s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 1.540439299s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 1.51748035s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 2.376751591s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 3.506204711s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 8.971668577s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 9.466592854s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 13.170218427s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 16.675823105s...
  BuildpackCompileFailed - App staging failed in the buildpack compile phase
  FAILED

Desired behavior

The application should successfully stage and start.

Affected Version

2.112.0

@MarcPaquette
Copy link
Contributor

Hi @davewalter

What indication do you have this is a Diego issue at this time?

Is there any place where we can access the Diego cell logs?

Thanks!

@dimivel
Copy link

dimivel commented Jan 27, 2025

hi @MarcPaquette,

In order to access Diego cell logs you need to jump on stable-bellatrix BBL environment.
This is how you can do it:

  • go to /relint-envs/environments/test/bellatrix

  • and execute
    docker run -it -v $PWD:/home/bbl cloudfoundry/cf-deployment-concourse-tasks
    cd /home/bbl/bbl-state
    eval "$(bbl print-env)"

  • after that navigate to bbl-state directory
    cd /home/bbl/bbl-state
    eval "$(bbl print-env)"

  • list bosh vms and ssh on one of the diego-cell vms
    sudo su -
    bosh vms
    bosh -d cf ssh diego-cell/79a1de3b-e814-4c0a-b431-a5f2c2985970

  • under /var/vcap/sys/log/ you can see logs for all jobs
    diego-cell/79a1de3b-e814-4c0a-b431-a5f2c2985970:/var/vcap/sys/log/

Let me know if you have issues to access the environment.

Also you can have a look at run-cats concourse job here:
https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/noble-stemcell-validation/jobs/run-cats/builds/1

Thanks!
Dimitar

@dimivel
Copy link

dimivel commented Feb 3, 2025

hi @MarcPaquette,

Did you manage to access the Diego cell logs?
Do you have any findings and feedback related to this issue?

Thanks and regards,
Dimitar

@MarcPaquette
Copy link
Contributor

MarcPaquette commented Feb 13, 2025

Unfortunately, we do not have permissions to access the linked environments.

From a slack thread discussing the issue:

**ERROR** Error during bootstrap: could not download: Get "https://buildpacks.cloudfoundry.org/dependencies/ruby/ruby_3.2.5_linux_x64_cflinuxfs4_7cb2e65f.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving
 Failed to compile droplet: Failed to run all supply scripts: exit status 15
 Exit status 223

Questions we need answered before we can proceed:

  • Is this Only for the Ruby application?
  • What do the BOSH DNS logs look like?
  • Is it only the Ruby buildpack experiencing this issue or all online buildpacks?
  • Do offline buildpacks work?

@dimivel
Copy link

dimivel commented Feb 13, 2025

The issue is not only related to Ruby buildpacks. This issue is for all online buildpacks.
Here is the error message for Nginx Buildpack:

Downloading app package...
     Downloaded app package (275B)
     -----> Nginx Buildpack version 1.2.20
     -----> Supplying nginx
     -----> No nginx version specified - using mainline => 1.27.2
     -----> Installing nginx 1.27.2
     Download [https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz]
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 597.761789ms...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 1.654064522s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 2.272499387s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 2.207866693s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 6.944909715s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 3.926117394s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 13.728837133s...
     error: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving, retrying in 9.703373385s...
     **ERROR** Could not install nginx: could not download: Get "https://buildpacks.cloudfoundry.org/dependencies/nginx/nginx_1.27.2_linux_x64_cflinuxfs4_25323d95.tgz": dial tcp: lookup buildpacks.cloudfoundry.org on 169.254.0.2:53: server misbehaving
     Failed to compile droplet: Failed to run all supply scripts: exit status 14
     Exit status 223
     Cell f8adee76-2440-412d-93e1-4b0a738b1561 stopping instance f66ca808-ac36-4beb-95d9-6b2ec9db3c3d
     Cell f8adee76-2440-412d-93e1-4b0a738b1561 destroying container for instance f66ca808-ac36-4beb-95d9-6b2ec9db3c3d
  BuildpackCompileFailed - App staging failed in the buildpack compile phase
  FAILED
  [FAILED] in [It] - /tmp/build/33ac16d1/cf-acceptance-tests/detect/buildpacks.go:224 @ 02/13/25 08:19:40.388

@beyhan
Copy link
Member

beyhan commented Feb 14, 2025

Adding Slack message below from @jpalermo because it really describes what has changed in Noble for bosh-dns:

Bosh-dns is VERY different under noble and if Diego is setting up the container to use 169.254.0.2:53 as a dns resolver, that might be the root of the problem.
In Noble we use systemd-resolved as the primary DNS resolver, and it’s got it’s own address that it listens on. Bosh-dns still listens on 169.254.0.2, but since systemd-resolved is now responsible for queries, bosh-dns isn’t configured with any recursors, so if you ask it for anything other than bosh-dns name lookups, it won’t be able to answer them.
I’m think systemd-resolved is listening on 127.0.0.53:53, so it might just be a matter of Diego needing to change the container config to use that instead (and maybe other container related things to allow network traffic through)

More details about the change are available here.

@dimivel
Copy link

dimivel commented Feb 17, 2025

hi @MarcPaquette,
I managed to redeploy CF with adjusted DNS server IP:

 - name: silk-cni
    release: silk
    properties:
      dns_servers:
      - 127.0.0.53

The result is that I managed to push a sample go application but DNS resolution inside the application container doesn't work:

root@956f2846d91a:/home/bbl/test-app# cf ssh test-app
vcap@1632893c-1927-4f73-7ec1-54d3:~$

vcap@1632893c-1927-4f73-7ec1-54d3:~$ nslookup google.com
;; communications error to 127.0.0.53#53: connection refused
;; communications error to 127.0.0.53#53: connection refused
;; communications error to 127.0.0.53#53: connection refused
;; no servers could be reached


vcap@1632893c-1927-4f73-7ec1-54d3:~$ cat /etc/resolv.conf
nameserver 127.0.0.53

On the diego-cell itself DNS works as expected

diego-cell/5e18ae14-4d95-49b9-9c0d-363725b5dd1b:~$ cat /etc/resolv.conf
nameserver 127.0.0.53


diego-cell/5e18ae14-4d95-49b9-9c0d-363725b5dd1b:~$ nslookup google.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
Name:	google.com
Address: 142.250.185.110
Name:	google.com
Address: 2a00:1450:4001:81c::200e

@beyhan
Copy link
Member

beyhan commented Feb 18, 2025

Just a clarification for the comment above made by @dimivel. The simple go application was pushed with the binary buildpack and didn't require any internet connection during staging. That is why it was successful and doesn't mean that the staging container has a working DNS resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

4 participants