Profiling ArgoCD with Pprof

Jun 03, 2024

One of the things I love about engineering is the endless puzzles it throws our way. I can easily be carried away by a single unexpected performance degradation of a service, because it's so fun!

Recently we experienced one of these cases with ArgoCD, a continuous delivery tool for Kubernetes clusters.

Problem

As a part of the ArgoCD adoption we deployed it to the huge Kubernetes cluster, with ~5,000 physical nodes and hundreds of thousands of resources. Upon using it we noticed that ArgoCD status updates and reconciliation performance were noticeably slower than on smaller clusters.

It was visible on the metrics as well, internal workqueue was always high:

The reconciliation performance heatmap showed consistently high reconciliation times for all of our applications:

Upon closer look we figured out that the actual resource sync goes pretty fast, but the status updates in the UI and in the CRD was taking minutes to complete. Based on the code it's done by using so-called "status processors", like so

	for i := 0; i < statusProcessors; i++ {
		go wait.Until(func() {
			for ctrl.processAppRefreshQueueItem() {
			}
		}, time.Second, ctx.Done())
	}

This would be an important detail later.

Fix Attempt #1

As a first attempt we relied on community knowledge to configure ArgoCD for larger scale. Some great material you can find here and here. Things we tried:

Bumped controller.status.processors and controller.operation.processors to recommended higher values
Bumped controller/repo-server/server resource requests/limits to the point where it no longer was throttled or OOMed
Bumped clientQPS and burstQPS
Bumped repo-server/server replicas

We use one argocd instance per cluster, so we left sharding capabilities untouched, because it would have no impact on our setup. Unfortunately nothing from the list above helped and arguably made things slightly worse 🤔

Let's look closer

If you get stuck with anything performance related in golang, pprof is your best friend. We discussed it in details in our Debug Golang Memory Leaks newsletter issue some time ago - how to run it, how to read flamegraphs, available profiling data, etc. Check it out if you need a refresher.

ArgoCD has pprof endpoint enabled by default, which is convenient (but make sure to not expose it to untrusted networks, as your infosec team might knock on your door)

First I wanted to look at why it uses ~4 CPU cores and ~8 Gbs of memory to operate. Depending on how your argocd is setup, you should be able to do something like this to get access to the CPU profiling information.

$ kubectl port-forward pod/argocd-application-controller-0 8082  -n argocd
$ go tool pprof -http=localhost:8080 localhost:8082/debug/pprof/profile?seconds=5

It will open the web UI and by navigating to the flamegraphs I found this

This flamegraph basically shows that most of the CPU cycles spent in processAppRefreshQueueItem, which is exactly the functionality we see lagging. Looking further down the stack we can see that CPU is wasted on iterating through the hierarchy of resources that are located in ArgoCD cache implementation.

Fix Attempt #2

Looking at the code we can see that controller builds a cache of all the resources that are in the cluster and maintains the "internal view" of a cluster in order to be able to accurately compare the state of resources in the cluster with the desired state in Git. That comes with the caveat - even though you might have a handful of applications and couple hundreds resources in them, ArgoCD will cache all resources that exist in your cluster, even if it won't ever use them. To confirm this we can look into the "Resource Objects Count" in cache:

As you can see it has ~370k resources and ArgoCD spends a lot of time processing through these, so it can explain the slow processing, high CPU and high memory.

So what if we reduce the amount of resources ArgoCD keeps in cache by removing things we don't need? Then the status update performance and resource utilization should improve.

You can configure resource exclusion like so:

    resource.exclusions: |
      - apiGroups:
          - autoscaling.k8s.io
        kinds:
          - '*'
        clusters:
          - '*'
      - apiGroups:
          - discovery.k8s.io
        kinds:
          - EndpointSlice
        clusters:
          - '*'
...

After doing the above and excluding resources we don't care about, here are the results:

The amount of resources in cache reduced 3x to ~100k
Memory usage reduced 5x
Avg CPU usage reduced by 2x
Reconciliation speed improved, but only slightly, maybe by 10% 😢

We were back to square one

Let's dig more

Another useful feature pprof has is ability to take goroutine dumps. You can do so by querying the pprof endpoint like so:

# full goroutine dump
$ curl 'localhost:8082/debug/pprof/goroutine?debug=2'

# "grouped" view
$ curl 'localhost:8082/debug/pprof/goroutine?debug=2'

I was looking for processAppRefreshQueueItem to understand where it gets stuck.

The goroutine dump would look something like so for this call:

#	0x46dba4	sync.runtime_SemacquireRWMutexR+0x24									/usr/local/go/src/runtime/sema.go:82
#	0x1eec619	sync.(*RWMutex).RLock+0x99										/usr/local/go/src/sync/rwmutex.go:71
#	0x1eec5f4	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).IterateHierarchy+0x74			/go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240122213038-792124280fcc/pkg/cache/cluster.go:972
#	0x2259aa4	github.com/argoproj/argo-cd/v2/controller/cache.(*liveStateCache).IterateHierarchy+0xc4			/go/src/github.com/argoproj/argo-cd/controller/cache/cache.go:604
#	0x22c43c2	github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).getResourceTree+0x782		/go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:535
#	0x22c3675	github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).setAppManagedResources+0x95		/go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:455
#	0x22d04b6	github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppRefreshQueueItem+0x14b6	/go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1596
#	0x22c83c4	github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run.func3+0x24			/go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:819
#	0x12fda52	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x32						/go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157
#	0x12fd90e	k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xae							/go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158
#	0x12fd7fe	k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x7e							/go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135
#	0x12fd73d	k8s.io/apimachinery/pkg/util/wait.Until+0x1d								/go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92

As you can see this goroutine is stuck waiting on sync.(*RWMutex).RLock at the top of the stack.

By taking look at grouped view the problem I had 💡 moment: 49/50 status processors were waiting on the lock, so it was a lock contention problem:

24 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).IterateHierarchy
14 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).GetAPIResources
6 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).GetManagedLiveObjs
5 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).FindResources
1 active on github.com/argoproj/gitops-engine/pkg/cache.(*Resource).iterateChildren

Unfortunately in golang there is no easy way to understand who is holding the lock, but going through the code it's likely related to the watch, which updates the cache on k8s resources updates. In large clusters, where resource updates occur frequently, this lock contention could explain the performance issues.

Fix Attempt #3

We were running ArgoCD 2.5 and I found two issues related to locks (1 and 2). The first one was to likely improve the lock contention cause it removes unnecessary lock from GetAPIResources call, which we saw occupying significant amount of goroutines.

So we upgraded to ArgoCD 2.10 and confirmed that GetAPIResources no longer waits for a lock 🚀

The reconcile speed improved... somewhat 😔. The workqueue visibly reduced on the graphs during low pod churn periods, but still took minutes when a lot of pods being scheduled.

Taking a fresh goroutine dump:

36 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).IterateHierarchy
10 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).GetManagedLiveObjs
3 goroutines blocked on github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).FindResources
1 active on github.com/argoproj/gitops-engine/pkg/cache.(*Resource).iterateChildren

So the lock contention was just skewed towards IterateHierarchy call. Ugh!

Next Steps

So after trying all of the above solutions we are down to two options at this point, which were a last resort but will address the performance issues we were seeing

Significantly bump reconciliation timeout from 45s to 5 minutes or more. This will reduce the workqueue size, since applications would be time queued less often
Filter out resource updates not related to the "change in desired state of application". By default, an ArgoCD application is refreshed every time a resource that belongs to it changes. So if we filter out not-so-important, but frequently updated fields like /status we can also reduce the workqueue size and improve speed.

In conclusion, troubleshooting performance issues in complex systems can be challenging, but with the right tools and approach, it can also be a rewarding learning experience. Hope you enjoyed this deep dive! 🕵️

Codereliant’s Substack

Discussion about this post