Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix container logs #68

Merged

Conversation

Akrog
Copy link
Contributor

@Akrog Akrog commented Jun 25, 2024

This PR improves the log gathering in the following way:

  • Make services and operator namespaces configurable instead of statically set to openstack and openstack-operators
  • Go back to retrieving the logs via the OpenShift API
  • Only get rotated logs in SOS gathering
  • Fix gathering not waiting for background tasks!!
  • Create symlinks to rotated logs

Akrog added 5 commits June 25, 2024 20:46
In this patch we replace the static namespaces `openstack` and
`openstack-operators` with environmental variables `OSP_NS` and
`OSP_OPERATORS_NS` that have those values as default.

This is necessary because deployments could use a different namespace
and then we wouldn't gather the right information.
For the time being go back to getting the pod logs using the OpenShift
API.

This is a partial revert of commit
4baaa87.

This resolves a number of existing limitations:

- Not following the namespaces/<NS>/pods/<podname>/logs/ structure
- Not having logs directly accessible on the CI jobs web server
- Not having logs if SOS report gathering fails
- Not having operator logs if the OSP services have not been deployed
  yet.

But it introduces duplication of gathered logs and it's less efficient.
To remove the duplicated logs, using `oc logs` and via sos reports, this
patch changes how sos reports gather logs and limits them to only
rotated logs.

We are forced to ignore tar errors because a file can be removed while
processing it due to the log rotation mechanism.

The code only adds the logs if there are any rotated logs available.
Our current `gather_run` script doesn't wait until all background tasks
have completed when the database is not being gathered.

This is caused by the `gather_db` script using `exit 0` instead of
`return` when sourced.

This patch makes that small change and now the code will wait for all
background operations as intended.

This can be problematic in some cases because we can find the
must-gather tar file missing some of the data that we want collected.
We currently have 2 locations that have container logs:

- `must-gather/namespaces/*/pods/*/logs/`
- `must-gather/sos-reports/_all_nodes/*/var/log/pods/*/...`

This means that we could inadvertently miss some of the files if we
don't know of the two locations.

To alleviate this situation this patch builds symbolic links to the
rotated log files in the `namespaces` logs location.

Because there can be 2 log files with the same name on 2 different
nodes, the code renames them.
@openshift-ci openshift-ci bot requested review from fmount and viroel June 25, 2024 18:49
Copy link
Contributor

@fmount fmount left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -2,10 +2,19 @@

source "${DIR_NAME}/bg.sh"

export OSP_NS="${OSP_NS-openstack}"
export OSP_OPERATORS_NS="${OSP_OPERATORS_NS-openstack-operators}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this is a good idea. Originally DEFAULT_NAMESPACES was a collection of default values that might or might not be present in the cluster, and this array was supposed to be grown by passing ADDITIONAL_NAMESPACES, but this improves that logic, and I like it.

# describe pod
run_bg oc -n "$NS" describe pod "$pod" '>' "${pods_dir}/${pod}-describe"
# We make a single request to get lines in the form <pod> <container> <crash_status>
data=$(oc -n "$NS" get pod -o go-template='{{range $indexp,$pod := .items}}{{range $index,$element := $pod.status.containerStatuses}}{{printf "%s %s" $pod.metadata.name $element.name}} {{ if ne $element.lastState.terminated nil }}{{ printf "%s" $element.lastState.terminated }}{{ end }}{{ printf "\n"}}{{end}}{{end}}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 here, I re-tested locally (just because CI is weird, it returns "forbidden" and we can't actually inspect the logs :/) and works as expected.

@@ -25,7 +25,7 @@ if [ "${OPENSTACK_DATABASES-unset}" = "unset" ] || [[ -z "${OPENSTACK_DATABASES}
# If no databases options are passed, skip the database dump
echo "Skip Database dump: an empty list is provided"
[[ $CALLED -eq 1 ]] && exit 0
exit 0
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 thanks, good catch on this

# For visual comfort when it's not a gz file we'll use a different pattern
# [<container_name>.]<filename>.<logfile_extension>.<nodename>
# That way we'll end up with:
# manager.1.log.20240624-185849.sosreport-crc-vlf7c-master-0.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment, now I understand now better the logic of building the file name.

Copy link

openshift-ci bot commented Jun 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 3d15255 into openstack-k8s-operators:main Jun 26, 2024
4 checks passed
@gibizer
Copy link
Contributor

gibizer commented Jun 26, 2024

Testing the collection of the rotated logs in openstack-k8s-operators/nova-operator#804

@gibizer
Copy link
Contributor

gibizer commented Jun 27, 2024

Testing the collection of the rotated logs in openstack-k8s-operators/nova-operator#804

cool. The ocp sosreport now only contains the rotated logs, and the namespaces/pods tree contains the non-rotated one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants