Debug Golang Memory Leaks with Pprof

Jul 12, 2023

Photo by ThisisEngineering RAEng / Unsplash

Managing memory effectively is important for the performance of any application. While Golang's garbage collector typically does an excellent job of managing memory, they can still occur. A memory leak arises when an application doesn't release memory back to the operating system, even if it's no longer in use. In large applications, these leaks can lead to Out of Memory (OOM) errors and can impact application availability.

A Golang memory leak often happens due to infinite loops, improper use of goroutines, or not releasing references to memory once they're no longer needed. In this post we'll discuss how golang pprof can be used for memory profiling and fixing leaks in Go.

Analyze the Signals

High memory usage might indicate a memory leak, but you need to confirm this first, since Linux uses an aggressive page caching mechanism which can lead to high reported memory usage (and it can be especially alarming in containerized environments where you have a fairly low resource ceiling).

Here is some key signals that can help your golang memory leak detection:

Increasing Memory Usage: Keep track of your application's memory usage over time. If the memory usage continuously rises and never decreases, you might be facing a memory leak.
Frequent Garbage Collection: If the garbage collector is running more frequently than usual and the system's memory usage remains high, this might be a sign of a memory leak. Go runtime exposes several garbage metrics like go_memstats_gc_cpu_fraction, go_memstats_gc_sys_bytes, go_memstats_last_gc_time_seconds, etc. If the frequency of GC increases and the system's memory usage remains high, this might indicate a leak.
Goroutine Count: An unexpected or uncontrolled increase in the number of goroutines can also indicate a memory leak. Goroutines are cheap but not free, and if not managed properly, they can consume a lot of memory, leading to a leak. You can track this using go_goroutines in your application.
Application restarts: A high number of restarts for an application may indicate that it's running into memory issues. In a Kubernetes environment for example, if a pod is using more memory than its limit, the kubelet will kill the offending process to reclaim memory, resulting in a pod restart.

$ kubectl top pod POD_NAME
NAME                        CPU(cores)   MEMORY(bytes)
my-app-pod-5d9f7d56c-nfztv  153m         324Mi

$ kubectl describe pod POD_NAME
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  50m                default-scheduler  Successfully assigned default/my-app-6d4884d985-9mb2z to minikube
  Warning  OOMKilled  15m (x2 over 20m)  kubelet            System OOM encountered, victim process: my-app, pid: 8915
  Warning  BackOff    10m (x4 over 15m)  kubelet            Back-off restarting failed container

OOM Killer: When your system runs out of memory due to excessive consumption, the Linux kernel invokes the OOM Killer to terminate processes and recover memory. These events are logged and can be viewed using the dmesg command, which prints out kernel and system logs. If you see OOM Killer logs related to your application, it's a strong indication of a memory issue, potentially a memory leak.

$ dmesg | grep -i OOM
[86907.015264] Out of memory: Killed process 12345 (my-app) total-vm:162492kB, anon-rss:65320kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:450kB oom_score_adj:0
[86907.341661] oom_reaper: reaped process 12345 (my-app), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

By keeping an eye on these metrics, signals, and using the relevant commands, you can confirm whether your application is suffering from a memory leak or not.

Enable Profiling

If you have confirmed that you actually have a leak the first debugging tool to reach for should be pprof, a built-in Go library for profiling Go programs.

The net/http/pprof package allows you to serve runtime profiling data in HTTP format. To use pprof, you need to import _ "net/http/pprof" in your main package, and start an HTTP server with http.ListenAndServe.

import (
    _ "net/http/pprof"
    "net/http"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:8080", nil)
    }()
    // your application code here
}

You can then use go tool pprof to interactively explore the data.

Here's a command to start a pprof session:

$ go tool pprof http://localhost:8080/debug/pprof/heap

The pprof tool provides various commands to help analyze the profile:

top: Shows you the functions that allocate the most memory.
list: You can list the source code of a specific function. For example, list myFunction.
web: Generates a graph of the memory allocation and the function calls.

Let's Try It!

Consider the following Go HTTP server:

package main

import (
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof"
	"sync"
)

type UserData struct {
	Data []byte
}

type UserCache struct {
	mu    sync.Mutex
	Cache map[string]*UserData
}

func NewUserCache() *UserCache {
	return &UserCache{
		Cache: make(map[string]*UserData),
	}
}

var userCache = NewUserCache()

func handleRequest(w http.ResponseWriter, r *http.Request) {
	userCache.mu.Lock()
	defer userCache.mu.Unlock()

	userData := &UserData{
		Data: make([]byte, 1000000),
	}

	userID := fmt.Sprintf("%d", len(userCache.Cache))
	userCache.Cache[userID] = userData
	log.Printf("Added data for user %s. Total users: %d\n", userID, len(userCache.Cache))
}

func main() {
	http.HandleFunc("/leaky-endpoint", handleRequest)
	http.ListenAndServe(":8080", nil)
}

In this example, the server stores data for each user in the UserCache. On every request to /leaky-endpoint, new user data is created and added to the cache. However, there's no code to remove old user data from the cache.

You can simulate the leak by bombarding the server with a large number of requests using a tool like curl or ab.

$ ab -n 1000 -c 10 http://localhost:8080/leaky-endpoint
...
Finished 1000 requests

Once the requests are completed, you can generate a heap profile by executing the following command in another terminal:

$ go tool pprof -alloc_space http://localhost:8080/debug/pprof/heap
(pprof) top
Showing nodes accounting for 521.18MB, 100% of 521.18MB total
      flat  flat%   sum%        cum   cum%
  521.18MB   100%   100%   521.18MB   100%  main.handleRequest
         0     0%   100%   521.18MB   100%  net/http.(*ServeMux).ServeHTTP
         0     0%   100%   521.18MB   100%  net/http.(*conn).serve
         0     0%   100%   521.18MB   100%  net/http.HandlerFunc.ServeHTTP
         0     0%   100%   521.18MB   100%  net/http.serverHandler.ServeHTTP
(pprof)

As we can see `handleRequest` is where the most allocations happen. It can also be confirmed by visual representation by doing

(pprof) web

Let's look at handleRequest more closely to identify where the leak comes from

(pprof) list handleRequest
Total: 521.18MB
ROUTINE ======================== main.handleRequest in /Users/codereliant/dev/code-snippets/leaky_endpoint.go
  521.18MB   521.18MB (flat, cum)   100% of Total
         .          .     28:func handleRequest(w http.ResponseWriter, r *http.Request) {
         .          .     29:	userCache.mu.Lock()
         .          .     30:	defer userCache.mu.Unlock()
         .          .     31:
         .          .     32:	userData := &UserData{
  521.18MB   521.18MB     33:		Data: make([]byte, 1000000),
         .          .     34:	}
         .          .     35:
         .          .     36:	userID := fmt.Sprintf("%d", len(userCache.Cache))
         .          .     37:	userCache.Cache[userID] = userData
         .          .     38:	log.Printf("Added data for user %s. Total users: %d\n", userID, len(userCache.Cache))

We were able to identify the exact line where the allocations happen, so now we can fix it by for example introducing the cache eviction policy.

Bonus: More Pprof Goodies

In addition to the techniques discussed earlier, pprof provides additional features and functionalities that can further enhance your profiling experience. Let's explore a few of these:

Profiling CPU Usage

You can profile your application's CPU usage using the goroutine and threadcreate profiles. To generate these profiles, you can use the following commands:

$ go tool pprof -http=localhost:8081 http://localhost:8080/debug/pprof/goroutine
$ go tool pprof -http=localhost:8081 http://localhost:8080/debug/pprof/threadcreate

These commands launch the pprof web interface for analyzing the CPU profiles. You can visualize the number of goroutines and threads created over time and identify potential bottlenecks or excessive CPU consumption.

Profiling Mutex Contention

Mutex contention can impact the performance of your application. pprof provides a profile for analyzing mutex contention. You can generate the mutex profile using the following command:

$ go tool pprof -http=localhost:8081 http://localhost:8080/debug/pprof/mutex

This command opens the pprof web interface, where you can explore the mutex profile and identify sections of code where contention occurs. By reducing mutex contention, you can improve the concurrency and overall performance of your application.

Heap Diff Profiling

Heap diff profiling allows you to compare two heap profiles to identify differences in memory allocations between two points in time. You can generate a heap profile and save it to a file using the following command:

$ curl -s http://localhost:8080/debug/pprof/heap > base.heap

After making some changes to your code or performing specific operations, you can generate another heap profile and save it to a different file:

$ curl -s http://localhost:8080/debug/pprof/heap > current.heap

To compare the two heap profiles and identify differences, use the following command:

$ go tool pprof -http=localhost:8081 --base base.heap current.heap

This command opens the pprof web interface, where you can analyze the differences in memory allocations and identify potential memory leaks or areas of improvement.

OOM errors can be daunting, but with the right tools and techniques, you can systematically trace and address them. Golang's pprof provides powerful features to understand memory utilization, making it easier to track down and fix these issues. Remember, the best way to fix memory leaks is not to have them in the first place :)

Codereliant’s Substack

Discussion about this post