Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive secret resources generation issue with starborad scanning #936

Open
gurugautm opened this issue Jan 31, 2022 · 10 comments
Open

Excessive secret resources generation issue with starborad scanning #936

gurugautm opened this issue Jan 31, 2022 · 10 comments
Assignees
Labels
🙅 wontfix This will not be worked on

Comments

@gurugautm
Copy link

What steps did you take and what happened:

Environment : OpenShift v.4.7
Aqua v6.2.x
Aqua Enforcer installed with non-privileged mode
Kube Enforcer with starboard installed

When we perform an scan using starboard, it created a scan job and a secret. But when scan failed secret didn't get deleted. In customer env its not deleted even when scan is successful.
due to this It created multiple secrets around 80k in customer env.

What did you expect to happen:

Temp secrets should be auto-deleted even when scan is successful or failed.

@shadowbreakerr
Copy link

I'm seeing the same symptoms on EKS, hundreds of secrets created in the starboard namespace.

Environment: EKS (1.20)
Starboard-Operator 0.13.2

@danielpacak
Copy link
Contributor

danielpacak commented Feb 15, 2022

It would be very helpful to see some logs streamed by Starboard Operator's pod and minimal reproduction steps on upstream K8s cluster. We have limited capacity to support managed platforms with custom configurations. In particular, I'd like to see what is the root cause of scan jobs failing, which probably prevents us from cleaning up orphaned Secrets properly. I can only assume it's related to some PSP or admission control that prevents scan jobs from running successfully, but we need more details to advice.

It's also very useful to look at events created in the starboard-system namespace with kubectl get events -n starboard-system. (Sometimes pods do not have enough information in under ContainerStatuses, but we can figure from events why certain pods failed.)

@danielpacak danielpacak added the ⏳ additional info required Additional information required to close an issue label Feb 15, 2022
@markussiebert
Copy link

markussiebert commented Feb 24, 2022

Today this killed the secrets api in one of our clusters ....

kubectl get secrets -n starboard-operator | grep -c Opaque
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get secrets)
7500

@danielpacak
In our case the root cause was the github api limit while updating ...

When deleting the starboard namespace I found that there were far over 20k secrets created in 9 days

@MPritsch
Copy link

MPritsch commented Mar 2, 2022

Edit: So apparently I was not able to really disable polaris. I just removed our ImageReference, which somehow delayed the error messages. Not sure how that works but that explains why initially didn't see secrets being created.

In our case the issue was the plugin "polaris" which kept failing. I let it run for a day in which it produced 203 Error Logs in the starboard-operator and left 1569 secrets behind. I'm not sure how these numbers correlate, maybe there are 7-8 retries on average? Removing the plugin stopped the errors and stopped leaving secrets behind.

The secrets which are left behind contain the values worker.password and worker.username

I'm unable to further track down the issue because the scan jobs immediately die and won't leave logs. Here is the logentry from the starboard-operator (reformatted for better readability):

{
  "level": "error",
  "ts": 1646135225.9482412,
  "logger": "reconciler.configauditreport",
  "msg": "Scan job container",
  "job": "starboard-operator/scan-configauditreport-797f6d9d6d",
  "container": "polaris",
  "status.reason": "Error",
  "status.message": "",
  "stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*ConfigAuditReportReconciler).reconcileJobs.func1
	/home/runner/work/starboard/starboard/pkg/operator/controller/configauditreport.go:363
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227"
}

Versions we have been using:
Environment: AWS EKS 1.21
Starboard-Operator: aquasec/starboard-operator:0.14.1
Starboard-Operator Helm Chart: 0.9.1
Trivy: aquasec/trivy:0.24.0
Trivy Helm Chart: 0.4.11
Polaris: fairwinds/polaris:5.0

@MPritsch
Copy link

MPritsch commented Mar 3, 2022

I found the parameter to disable polaris and let it run for over an hour. So far no error logs regarding polaris and also no secrets being stuck. Alternatively I tried to switch to Conftest instead of Polaris but received different errors and abandoned the idea.

Here is the parameter to disable configAuditScanner and therefore polaris as well.

operator: {
              configAuditScannerEnabled: false,
}

@danielpacak
Copy link
Contributor

Thank you for the feedback @MPritsch We are actually working on so called built-in configuration audit scanner that is going to displace Polaris and Conftest plugins in the upcoming release. It won't create Kubernetes Job objects nor Secrets and it will be much faster. See #971 for more details.

@cdesaintleger
Copy link

cdesaintleger commented Mar 3, 2022

Same issue, on cluster with smart jobs. (with private registry)
ex : Job is created, scan beginning , job terminated and deleted before the end of scan. The secret remains.

@MPritsch
Copy link

MPritsch commented Mar 4, 2022

We now have a working version with polaris. The underlying issue were missing IAM permissions. We also needed to use polaris 4.2 instead of 5.0.

Every starup of the starboard-operator we received "401 Unauthorized: Not Authorized" error for AWS Images from ECR. E.g.:

{
  "level": "error",
  "ts": 1646123793.6945415,
  "logger": "reconciler.vulnerabilityreport",
  "msg": "Scan job container",
  "job": "starboard-operator/scan-vulnerabilityreport-6cd9546b84",
  "container": "fluent-bit",
  "status.reason": "Error",
  "status.message": "2022-03-01T08:36:33.091Z\t\u001b[31mFATAL\u001b[0m\tscanner initialize error: unable to initialize the docker scanner: 3 errors occurred:
	* unable to inspect the image (906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
	* unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory
	* GET https://906394416424.dkr.ecr.eu-central-1.amazonaws.com/v2/aws-for-fluent-bit/manifests/2.21.5: unexpected status code 401 Unauthorized: Not Authorized",
  "stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*VulnerabilityReportReconciler).reconcileJobs.func1
	/home/runner/work/starboard/starboard/pkg/operator/controller/vulnerabilityreport.go:320
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227"
}

These were the Images which produced the error. The account IDs are from AWS, not from us:

602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.3.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/kube-proxy:v1.21.2-eksbuild.2
602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.4-eksbuild.1
906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5

The solution for these images was giving following permissions to starboard (as described by the 'Important' block here https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-policy-examples.html)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage"
            ],
            "Resource": [
                "arn:aws:ecr:*:602401143452:repository/*",
                "arn:aws:ecr:*:906394416424:repository/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

This probably would have been much easier to debug with proper error messages from polaris...

@danielpacak danielpacak added 🙅 wontfix This will not be worked on and removed ⏳ additional info required Additional information required to close an issue labels Mar 9, 2022
@danielpacak
Copy link
Contributor

We've marked this issue as won't fix because we merged #971 that performs configuration audits without creating Kubernetes Jobs and Secrets. We call it a built-in configuration audit scanner and it will be enabled by default in the upcoming v0.15.0 release. Polaris and Conftest will be deprecated at some point.

We'll keep this issue open until v0.15.0 is released.

@MPritsch
Copy link

Just a quick update. While we were able to fix the secret creation and errors on one cluster, another one keeps creating secrets. Not sure if this is a permission problem again, although we don't see any errors regarding them. We will disable Polaris completely and wait for your v0.15.0 release to replace it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙅 wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

7 participants