Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple VA1 network stage from control plane stage #489

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

abays
Copy link
Contributor

@abays abays commented Jan 30, 2025

This removes the networking components...

  1. MetalLB
  2. NetConfig
  3. Net-attach-def

...from the control plane stage of VA1 and gives them their own separate stage in between the NNCP stage and the control plane stage. This is needed for the UniGamma adoption job, which relies upon this VA. They want to create the NNCPs, MetalLB CRs, NetConfig CR and Net-attach-defs, but do NOT want to create the OpenStackControlPlane. Now they can do so by creating an automation file with only two stages, one for NNCPs and the other for the other networking components.

I have tested this in my local env and the VA deployed successfully via CIFMW.

@openshift-ci openshift-ci bot requested review from cjeanner and raukadah January 30, 2025 20:38
Copy link

openshift-ci bot commented Jan 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5c9ea5c7d8284ddcb63e13f31762b9ba

✔️ noop SUCCESS in 0s
✔️ rhoso-architecture-validate-hci SUCCESS in 5m 01s
rhoso-architecture-validate-hci-adoption FAILURE in 4m 07s

@abays abays force-pushed the hci_va_network_reorg branch 4 times, most recently from 527e2fc to b3aef90 Compare January 30, 2025 21:32
@abays abays requested review from fultonj and removed request for cjeanner and raukadah January 31, 2025 00:01
@ciecierski ciecierski self-assigned this Jan 31, 2025
@arxcruz
Copy link
Contributor

arxcruz commented Feb 4, 2025

We need this patch also for all other unijobs, I would suggest to have it splited in different patches so we can test it separately

@abays abays force-pushed the hci_va_network_reorg branch from b3aef90 to 81c1b68 Compare February 4, 2025 10:04
@abays
Copy link
Contributor Author

abays commented Feb 4, 2025

The latest push moves the patches and replacements from the examples/va/hci/control-plane/network dir into a more proper location: va/hci/network. The output from the new network build stage is still the same.

@abays
Copy link
Contributor Author

abays commented Feb 4, 2025

The latest push moves the patches and replacements from the examples/va/hci/control-plane/network dir into a more proper location: va/hci/network. The output from the new network build stage is still the same.

I wonder if these network references should actually be renamed to networking instead to align with the underlying lib dir [1].

[1] https://github.com/openstack-k8s-operators/architecture/tree/main/lib/networking

@abays
Copy link
Contributor Author

abays commented Feb 4, 2025

@ciecierski @arxcruz Has anyone had a chance to test this downstream at all?

@fultonj
Copy link
Contributor

fultonj commented Feb 4, 2025

I'm +2 to merge this. Let's just confirm the jobs are green first.

@ciecierski
Copy link
Contributor

@ciecierski @arxcruz Has anyone had a chance to test this downstream at all?

Hi @abays,

I was able to test it downstream with manual run and @arxcruz too in the job.
Everything worked as expected for decoupling. There is only one problem with web hook observed when change is run from job. Patches are applied, but web hook is complaiing with error.

`TASK [kustomize_deploy : Apply generated content for examples/va/hci/control-plane _raw_params=oc apply -f {{ _cr }}] ***
Sunday 02 February 2025 20:57:09 +0200 (0:00:00.076) 0:05:18.397 *******
fatal: [localhost]: FAILED! => changed=true
cmd:

  • oc
  • apply
  • -f
  • /home/zuul/architecture/examples/va/hci/control-plane/networking.yaml
    delta: '0:00:00.641656'
    end: '2025-02-02 20:57:10.176771'
    msg: non-zero return code
    rc: 1
    start: '2025-02-02 20:57:09.535115'
    stderr: 'Error from server (InternalError): error when creating "/home/zuul/architecture/examples/va/hci/control-plane/networking.yaml": Internal error occurred: failed calling webhook "mnetconfig.kb.io": failed to call webhook: Post "https://infra-operator-webhook-service.openstack-operators.svc:443/mutate-network-openstack-org-v1beta1-netconfig?timeout=10s": no endpoints available for service "infra-operator-webhook-service"'
    stderr_lines:
    stdout: |-
    networkattachmentdefinition.k8s.cni.cncf.io/ctlplane created
    networkattachmentdefinition.k8s.cni.cncf.io/datacentre created
    networkattachmentdefinition.k8s.cni.cncf.io/internalapi created
    networkattachmentdefinition.k8s.cni.cncf.io/storage created
    networkattachmentdefinition.k8s.cni.cncf.io/tenant created
    ipaddresspool.metallb.io/ctlplane created
    ipaddresspool.metallb.io/internalapi created
    ipaddresspool.metallb.io/storage created
    ipaddresspool.metallb.io/tenant created
    l2advertisement.metallb.io/ctlplane created
    l2advertisement.metallb.io/internalapi created
    l2advertisement.metallb.io/storage created
    l2advertisement.metallb.io/tenant created
    stdout_lines: `

@fultonj
Copy link
Contributor

fultonj commented Feb 4, 2025

Good idea to s/network/networking/ in this patch just for consistency

@abays abays force-pushed the hci_va_network_reorg branch from 81c1b68 to acb7cd0 Compare February 4, 2025 15:32
@ciecierski
Copy link
Contributor

I think @arxcruz and team would like to test downstream uni greenfield job to be sure we don't break existing job, before this change is merged.

@abays
Copy link
Contributor Author

abays commented Feb 4, 2025

@ciecierski @arxcruz Has anyone had a chance to test this downstream at all?

Hi @abays,

I was able to test it downstream with manual run and @arxcruz too in the job. Everything worked as expected for decoupling. There is only one problem with web hook observed when change is run from job. Patches are applied, but web hook is complaiing with error.

`TASK [kustomize_deploy : Apply generated content for examples/va/hci/control-plane _raw_params=oc apply -f {{ _cr }}] *** Sunday 02 February 2025 20:57:09 +0200 (0:00:00.076) 0:05:18.397 ******* fatal: [localhost]: FAILED! => changed=true cmd:

* oc

* apply

* -f

* /home/zuul/architecture/examples/va/hci/control-plane/networking.yaml
  delta: '0:00:00.641656'
  end: '2025-02-02 20:57:10.176771'
  msg: non-zero return code
  rc: 1
  start: '2025-02-02 20:57:09.535115'
  stderr: 'Error from server (InternalError): error when creating "/home/zuul/architecture/examples/va/hci/control-plane/networking.yaml": Internal error occurred: failed calling webhook "mnetconfig.kb.io": failed to call webhook: Post "https://infra-operator-webhook-service.openstack-operators.svc:443/mutate-network-openstack-org-v1beta1-netconfig?timeout=10s": no endpoints available for service "infra-operator-webhook-service"'
  stderr_lines: 
  stdout: |-
  networkattachmentdefinition.k8s.cni.cncf.io/ctlplane created
  networkattachmentdefinition.k8s.cni.cncf.io/datacentre created
  networkattachmentdefinition.k8s.cni.cncf.io/internalapi created
  networkattachmentdefinition.k8s.cni.cncf.io/storage created
  networkattachmentdefinition.k8s.cni.cncf.io/tenant created
  ipaddresspool.metallb.io/ctlplane created
  ipaddresspool.metallb.io/internalapi created
  ipaddresspool.metallb.io/storage created
  ipaddresspool.metallb.io/tenant created
  l2advertisement.metallb.io/ctlplane created
  l2advertisement.metallb.io/internalapi created
  l2advertisement.metallb.io/storage created
  l2advertisement.metallb.io/tenant created
  stdout_lines: `

This likely another instance of this issue [1]. I do not think it is related to these changes, but rather due to the new install paradigm for OpenStack operator.

[1] https://issues.redhat.com/browse/OSPCIX-633

@arxcruz
Copy link
Contributor

arxcruz commented Feb 4, 2025

We will test it, I just ask a little bit of patience since unijobs are not quickly to test, we don’t t have enough resources to test everything at once.

@abays
Copy link
Contributor Author

abays commented Feb 4, 2025

We will test it, I just ask a little bit of patience since unijobs are not quickly to test, we don’t t have enough resources to test everything at once.

Cool. There's no rush on my end. This change is solely to support what you guys are trying to do, so whatever pace works for you is good for me too.

@dsariel
Copy link
Contributor

dsariel commented Feb 5, 2025

@jirimacku took OSPNW-908 which I guess relates to the downstream verification but better to wait for Jiri back from PTO.

I have "accidentally" verified that this change is not breaking uni07eta but it was expected without any verification :-)

Meanwhile created #490 to decouple uni07eta.

@ciecierski ciecierski self-requested a review February 5, 2025 08:57
@ciecierski ciecierski removed their assignment Feb 5, 2025
@abays abays force-pushed the hci_va_network_reorg branch from acb7cd0 to 6ac61dd Compare February 5, 2025 09:27
@abays
Copy link
Contributor Author

abays commented Feb 5, 2025

The latest push moves the patches and replacements from the examples/va/hci/control-plane/network dir into a more proper location: va/hci/network. The output from the new network build stage is still the same.

I wonder if these network references should actually be renamed to networking instead to align with the underlying lib dir [1].

[1] https://github.com/openstack-k8s-operators/architecture/tree/main/lib/networking

I believe all network dirs for VA1 have now been renamed to networking. Please let me know if anyone sees any spots that I missed.

Comment on lines +20 to +22
oc -n metallb-system wait pod
-l app=metallb -l component=speaker
--for condition=Ready
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really check whether the resources created in this stage are in fact ready. Unfortunately, those resources (NetConfig, Net-Attach-Def, L2Advertisement and IPAddressPool) do not have a status that can be examined. But our CI validation requires that we have at least one wait_conditions listed here, so I've opted to use MetalLB speaker pods as the target here...but it's really just a placeholder.

@ciecierski
Copy link
Contributor

@ciecierski @arxcruz Has anyone had a chance to test this downstream at all?

Hi @abays,
I was able to test it downstream with manual run and @arxcruz too in the job. Everything worked as expected for decoupling. There is only one problem with web hook observed when change is run from job. Patches are applied, but web hook is complaiing with error.
`TASK [kustomize_deploy : Apply generated content for examples/va/hci/control-plane _raw_params=oc apply -f {{ _cr }}] *** Sunday 02 February 2025 20:57:09 +0200 (0:00:00.076) 0:05:18.397 ******* fatal: [localhost]: FAILED! => changed=true cmd:

* oc

* apply

* -f

* /home/zuul/architecture/examples/va/hci/control-plane/networking.yaml
  delta: '0:00:00.641656'
  end: '2025-02-02 20:57:10.176771'
  msg: non-zero return code
  rc: 1
  start: '2025-02-02 20:57:09.535115'
  stderr: 'Error from server (InternalError): error when creating "/home/zuul/architecture/examples/va/hci/control-plane/networking.yaml": Internal error occurred: failed calling webhook "mnetconfig.kb.io": failed to call webhook: Post "https://infra-operator-webhook-service.openstack-operators.svc:443/mutate-network-openstack-org-v1beta1-netconfig?timeout=10s": no endpoints available for service "infra-operator-webhook-service"'
  stderr_lines: 
  stdout: |-
  networkattachmentdefinition.k8s.cni.cncf.io/ctlplane created
  networkattachmentdefinition.k8s.cni.cncf.io/datacentre created
  networkattachmentdefinition.k8s.cni.cncf.io/internalapi created
  networkattachmentdefinition.k8s.cni.cncf.io/storage created
  networkattachmentdefinition.k8s.cni.cncf.io/tenant created
  ipaddresspool.metallb.io/ctlplane created
  ipaddresspool.metallb.io/internalapi created
  ipaddresspool.metallb.io/storage created
  ipaddresspool.metallb.io/tenant created
  l2advertisement.metallb.io/ctlplane created
  l2advertisement.metallb.io/internalapi created
  l2advertisement.metallb.io/storage created
  l2advertisement.metallb.io/tenant created
  stdout_lines: `

This likely another instance of this issue [1]. I do not think it is related to these changes, but rather due to the new install paradigm for OpenStack operator.

[1] https://issues.redhat.com/browse/OSPCIX-633

So I'm now facing this error on my manual re-run of uni gamma adoption procedure. Everything is created except netconfig. Same output as in the error. Do you know what I need to change to fulfill requirements of new install paradigm for OpenStack operators?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants