6 min read

Profiling ArgoCD with Pprof

ArgoCD profiling

One of the things I love about engineering is the endless puzzles it throws our way. I can easily be carried away by a single unexpected performance degradation of a service, because it's so fun!

Recently we experienced one of these cases with ArgoCD, a continuous delivery tool for Kubernetes clusters.

Problem

As a part of the ArgoCD adoption we deployed it to the huge Kubernetes cluster, with ~5,000 physical nodes and hundreds of thousands of resources. Upon using it we noticed that ArgoCD status updates and reconciliation performance were noticeably slower than on smaller clusters.

It was visible on the metrics as well, internal workqueue was always high:

ArgoCD workqueue

The reconciliation performance heatmap showed consistently high reconciliation times for all of our applications:

ArgoCD reconciliation

Upon closer look we figured out that the actual resource sync goes pretty fast, but the status updates in the UI and in the CRD was taking minutes to complete. Based on the code it's done by using so-called "status processors", like so

	for i := 0; i < statusProcessors; i++ {
		go wait.Until(func() {
			for ctrl.processAppRefreshQueueItem() {
			}
		}, time.Second, ctx.Done())
	}

This would be an important detail later.

Fix Attempt #1

As a first attempt we relied on community knowledge to configure ArgoCD for larger scale. Some great material you can find here and here. Things we tried:

  • Bumped controller.status.processors and controller.operation.processors to recommended higher values
  • Bumped controller/repo-server/server resource requests/limits to the point where it no longer was throttled or OOMed
  • Bumped clientQPS and burstQPS
  • Bumped repo-server/server replicas

We use one argocd instance per cluster, so we left sharding capabilities untouched, because it would have no impact on our setup. Unfortunately nothing from the list above helped and arguably made things slightly worse 🤔

Let's look closer

If you get stuck with anything performance related in golang, pprof is your best friend. We discussed it in details in our Debug Golang Memory Leaks newsletter issue some time ago - how to run it, how to read flamegraphs, available profiling data, etc. Check it out if you need a refresher.

ArgoCD has pprof endpoint enabled by default, which is convenient (but make sure to not expose it to untrusted networks, as your infosec team might knock on your door)

First I wanted to look at why it uses ~4 CPU cores and ~8 Gbs of memory to operate. Depending on how your argocd is setup, you should be able to do something like this to get access to the CPU profiling information.

$ kubectl port-forward pod/argocd-application-controller-0 8082  -n argocd
$ go tool pprof -http=localhost:8080 localhost:8082/debug/pprof/profile?seconds=5

It will open the web UI and by navigating to the flamegraphs I found this

ArgoCD profiling

This flamegraph basically shows that most of the CPU cycles spent in processAppRefreshQueueItem, which is exactly the functionality we see lagging. Looking further down the stack we can see that CPU is wasted on iterating through the hierarchy of resources that are located in ArgoCD cache implementation.

Fix Attempt #2

Looking at the code we can see that controller builds a cache of all the resources that are in the cluster and maintains the "internal view" of a cluster in order to be able to accurately compare the state of resources in the cluster with the desired state in Git. That comes with the caveat - even though you might have a handful of applications and couple hundreds resources in them, ArgoCD will cache all resources that exist in your cluster, even if it won't ever use them. To confirm this we can look into the "Resource Objects Count" in cache:

ArgoCD object count

As you can see it has ~370k resources and ArgoCD spends a lot of time processing through these, so it can explain the slow processing, high CPU and high memory.

So what if we reduce the amount of resources ArgoCD keeps in cache by removing things we don't need? Then the status update performance and resource utilization should improve.

You can configure resource exclusion like so:

    resource.exclusions: |
      - apiGroups:
          - autoscaling.k8s.io
        kinds:
          - '*'
        clusters:
          - '*'
      - apiGroups:
          - discovery.k8s.io
        kinds:
          - EndpointSlice
        clusters:
          - '*'
...

After doing the above and excluding resources we don't care about, here are the results:

  • The amount of resources in cache reduced 3x to ~100k
  • Memory usage reduced 5x
  • Avg CPU usage reduced by 2x
  • Reconciliation speed improved, but only slightly, maybe by 10% 😢

We were back to square one

Let's dig more

Another useful feature pprof has is ability to take goroutine dumps. You can do so by querying the pprof endpoint like so:

# full goroutine dump
$ curl 'localhost:8082/debug/pprof/goroutine?debug=2'

# "grouped" view
$ curl 'localhost:8082/debug/pprof/goroutine?debug=2'

This post is for subscribers only