Profile-Guided Optimization: A Hands-On Guide to Reducing Computational Wastage

Over the past few months, I’ve had numerous discussions with practitioners and colleagues on the benefits of Profile-Guided Optimization (PGO). While it’s a topic that generates significant interest, many find it challenging to get started or simply lack the time to explore it fully. As a Product Manager in the continuous profiling domain, my curiosity drove me to delve deeper into this subject. After studying various academic papers and articles, I decided to implement PGO myself, benchmarking its impact to assess its true value.

My primary goal was to understand the challenges hindering PGO adoption and to identify key questions that could reveal customers’ real pain points. Additionally, I aimed to explore the business value for end users. Specifically, I wanted to quantify how PGO impacts critical business KPIs such as conversion rates, latency, and even SLOs and SLAs.

This blog summarizes my initial findings.

Key Takeaways:

Efficient software is both cheaper and greener 🌿.
PGO can boost your code’s efficiency by up to 14% for free—without requiring any code changes.
A practical guide to implementing PGO is presented, including insights on measuring compute and end-user performance gains using inlining output, binary size, go-wrk, and flamegraphs.
In the example Go code provided, my analysis revealed a notable performance gain of ~ 6.92% in compute efficiency—an impressive result considering it’s based on a small JSON unmarshalling task. The potential savings in a production environment could be even more substantial.
Many developers and SREs are missing out on potential cost savings by not leveraging PGO. You can learn from Cloudflare’s experience in reducing costs through PGO.
Continuous profiling in production is essential to unlock the benefits of PGO fully.

PGO: Your Code’s Personal Coach for Peak Performance

Imagine you’re an athlete training for the Olympics. A generic training regimen might get you started, but it won’t fully optimize your performance. This is why most athletes hire a personal coach. A personal coach observes your training, identifies areas for improvement, and adjusts your regimen accordingly. This personalized approach leads to better results in a shorter time.

PGO is like that coach for your code. It analyzes your code’s performance, identifies frequently executed code paths, and feeds that data to the compiler to refine its optimization decisions. By informing the compiler of your code’s hotspots, PGO helps your code achieve peak performance, just as a coach helps an athlete win an Olympic gold medal.

To connect the dots, the way a coach observes an athlete’s training regimen and physiological components is similar to profiling in this context. Just as a coach scouts their athletes, continuous profiling analyzes your code’s behavior on an ongoing basis. Additionally, your gold, as a developer or SRE, could be promotion, a reduction in carbon footprint and/or lowering cloud spend.

PGO is an advanced optimization technique that leverages profiling data to guide the compiler in making more informed optimization decisions. By using PGO, the compiler can optimize hot code paths, potentially leading to significant performance improvements.

Further, PGOs can improve your code’s resource utilization efficiency by up to 14%. That’s a lot of cost savings in most medium - to large organizations without making any code changes, that’s free money on the table!

As of Go 1.22, benchmarks for a representative set of Go programs show that building with PGO improves performance by around 2-14%. We expect performance gains to generally increase over time as additional optimizations take advantage of PGO in future versions of Go. (source: PGO Overview, Google)

Cloudflare recently shared a blog post detailing how they significantly reduced cloud spending by implementing PGO.

This indicates that following the release, we’re using ~97 cores fewer than before the release, a ~3.5% reduction. (source:Colin Douch, Cloudflare)

prom

Comparing the before and after flamegraphs of PGO. See details in the results section below

Compilers and the One-Size-Fits-All Problem

As described in the athlete analogy above (inspired by the recent 2024 Paris Olympics), one of the compiler’s primary tasks is to make optimal decisions about your code during compilation. Compilers come equipped with heuristics that guide various optimization techniques—such as dead code elimination, register allocation, constant folding, and function inlining. These strategies are designed to streamline your code and enhance its efficiency.

However, the heuristics that guide the compiler’s decisions have the same problem inherent in most one-size-fits-all designs. There are limits to what a compiler can do on its own. For example, while inlining functions can reduce the overhead of function calls, the compiler can’t inline everything. Inlining too much can lead to bloated binaries, increased cache usage, and ultimately, performance degradation.

This is where PGO comes in. By providing the compiler with profiling data—information about how your code actually runs in real-world scenarios—the compiler can make more informed decisions. It can identify the “hot” functions that are frequently called and optimize them more aggressively, such as by inlining these critical functions or better allocating registers. This targeted optimization helps reduce the overhead of function calls and can lead to significant performance gains. PGO allows the compiler to go beyond generic optimizations, tailoring the final binary to the specific needs of your program based on actual usage.

How does PGO work?

Several other languages support PGO. See Laguages that Support PGO section. In this example, let’s consider a simple Go application that unmarshals JSON data containing personal bios. The application reads data from a file, processes it, and then serves it over HTTP. Here’s the code https://github.com/iogbole/go-pgo

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof" // Import pprof for profiling
	"os"
)

type BioWrapper struct {
	Bio Bio `json:"bio"`
}

//truncated 

// extractNames is a leaf function that processes course names.
func extractNames(data []string) []string {
	names := make([]string, 0, len(data))
	for _, name := range data {
		names = append(names, name)
	}
	return names
}

func main() {
	// Set up the HTTP server with pprof enabled
	http.HandleFunc("/", processor)
	log.Println("Starting server on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

// processor handles incoming HTTP requests, processes the JSON, and returns the JSON data.
func processor(w http.ResponseWriter, r *http.Request) {
	// Read the JSON file
	file, err := os.ReadFile("./bio.json")
	if err != nil {
		http.Error(w, "Error reading bio.json", http.StatusInternalServerError)
		return
	}

	// Unmarshal the JSON into the BioWrapper struct
	var bioWrapper BioWrapper
	err = json.Unmarshal(file, &bioWrapper)
	if err != nil {
		http.Error(w, "Error decoding JSON", http.StatusBadRequest)
		return
	}

	// Access the Bio data
	bio := bioWrapper.Bio

	// Process the Bio data (e.g., print to console)
	fmt.Println("Name:", bio.PersonalInfo.Name)
	fmt.Println("University:", bio.Education.University.SchoolName)
	fmt.Println("Current Job:", bio.WorkExperience.Job2.Role)

	// Use the leaf function to extract course names
	courseNames := extractNames(bio.Education.University.Courses)
	fmt.Println("Courses:", courseNames)

	// Set the response header to indicate JSON content
	w.Header().Set("Content-Type", "application/json")

	// Using json.MarshalIndent for pretty-printing
	responseData, err := json.MarshalIndent(bioWrapper, "", "    ")
	if err != nil {
		http.Error(w, "Error encoding JSON", http.StatusInternalServerError)
		return
	}

	w.Write(responseData)
}

Step 1: Compile and Run Without PGO

First, let’s compile the program without PGO and see how it performs under load.

go build -o pre_pgo -gcflags="-m"  main.go

./pre_pgo

Step 2: Load Testing with go-wrk

To simulate real-world usage, we can use go-wrk, a Go-based benchmarking tool, to send concurrent HTTP requests to the service. Install it from here.

go-wrk -d 20 http://localhost:8080

This command runs the load test for 60 seconds, generating HTTP requests to the server. The results will give us a baseline of how the program performs without PGO.

Step 3: Generate Profiling Data

While that’s running, let’s profile the code for 30 seconds.

curl -o default.pgo "http://localhost:8080/debug/pprof/profile?seconds=30"

Note that default.pgo is the pprof output. An alternative approach is to use the pprof tool.

go tool pprof -output=cpu.prof http://localhost:8080/debug/pprof/profile?seconds=60

Step 4: Compile and Run post PGO

Now, let’s compile the code with PGO enabled using the profiling data we collected:

go build -o post_pgo -pgo=auto -gcflags="-m"  main.go

./post_pgo

Step 5: Add load and Profile

Similar to step 3, generate load and profile - but this time, change the output of the curl to post_pgo.pprof.

curl -o post_pgo.pprof  "http://localhost:8080/debug/pprof/profile?seconds=30"

The Results!!

This was my first experience with PGO, and I was eager to see if it was worth the effort. I’ve saved all the raw benchmark files in the GitHub repository for future comparisons.

Let’s begin!

Inlining Optimization Analysis

Inlining is a compiler optimization technique that directly inserts the body of a function into the calling code, eliminating function call overhead and potentially enabling further optimizations.

Binary Size: Inlining Has a Small Storage Cost

A quick way to check for better inlining is to examine the binary size. While inlining can improve performance, it may increase the binary size due to code duplication.

Let’s check it out:

$ ls -atrl

-rwxr-xr-x 1 israel staff 8162050 Aug 23  18:45 pre_pgo
-rwxr-xr-x 1 israel staff 8244658 Aug 23  21:52 post_pgo

The difference in file sizes between the pre_pgo and post_pgo binaries can be summarized as follows:

File	Size	Difference	Analysis
`pre_pgo`	8,162,050 bytes	N/A	This is the size of the binary before applying PGO.
`post_pgo`	8,244,658 bytes	82,608 bytes larger i.e ~ 1.01% larger	The binary size increased slightly after PGO, likely due to additional inlining and optimizations.

Analysis of the compiler output

For the record, I nearly jumped out of my chair when I realized that the processor function was inlined. It makes all the difference, especially as it’s a leaf function.

Let’s dig a bit deeper by comparing the build outputs:

Before PGO

carbon (4) Refer to the repo for the full build output

Now let’s compare the compiler output for PGO.

After PGO

post_pgo Refer to the repo for the full build output

The inlining optimization looks way better! Here’s a table comparing the key differences between the compiler output with and without PGO, based on the provided gcflags="-m" output:

Metric	Post PGO	Pre PGO	Difference	Analysis
Inlining	`extractNames`, `fmt.Println`, `http.ListenAndServe`, etc.	`extractNames`, `processor`, `http.(*response).Write`, etc.	More functions inlined, including `processor` and HTTP response writes	More extensive inlining in PGO due to better optimization hints
Parameter Leaks	Multiple parameters escape to the heap	Similar, with some optimizations such as better handling in `extractNames`	Slight improvement in handling heap allocation in PGO	PGO provides better guidance for optimizing heap allocations
Moved to Heap	Variables like `bios` moved to heap	Variables like `bioWrapper` moved to heap	Slight change in what gets moved to heap	Heap allocations are slightly optimized with PGO, but not eliminated
PGO Devirtualization	Not applicable	PGO devirtualizes interface calls	Devirtualization is applied	PGO enables devirtualization, improving performance by optimizing interface calls
Leaking Param	`data`, `w`, etc. escape to heap	Similar parameters leak but with better optimizations such as `extractNames`	Slight reduction in the number of heap escapes	PGO guides more efficient memory management, but some leaks persist

Flamegrapsh analysis

Using https://www.speedscope.app, I uploaded the before and after pprof files for quick analysis. Since Speedscope lacks a differential flamegraph feature, I compared both flamegraphs side-by-side.

But first, let’s identify the top functions and verify the weight of the callers and callees.

The sandwitch view - showing callers and callees

The sandwich view shows that we need to focus on the main.processor function, which is somewhat expected. Let’s verify this hypothesis with a flamegraph.

speedscope Differential flamegraphs

The table below summarizes the analysis of the before and after PGO flamegraphs

Metric	Pre PGO (Left)	Post PGO (Right)	Time Gained/Lost	Percentage Improvement
`main.processor` Total Time	48.25 seconds	44.91 seconds	3.34 seconds gained	6.92% improvement
`main.processor` Self Time	10.00ms	0.00ns	10.00ms gained	100% improvement
HTTP Handler Functions Total Time	Higher	Lower	Not Quantified	Slight improvement (qualitative)
Syscall Execution Time	Higher	Lower	Not Quantified	Slight improvement (qualitative)

Looking a bit further, the 6.92% improvement is most likely derived by inlining the main.processor function.

End-user/business impact analysis

The flamegraph analysis revealed a 6.92% improvement in overall code efficiency. While this is a substantial gain, especially considering cloud costs, it’s crucial to understand how it translates into real-world benefits for end users. To quantify the impact, I turned to the go-wrk benchmarking data, located here. By analyzing these metrics, I can assess the practical performance improvements that users will experience. Let’s delve into the details and see how these optimizations translate into tangible benefits

Baseline Pre_PGO output

before

and the comparison post pgo go-wrk result is:

after

Let’s delve into the details and see how these optimizations could translate to business KPIs.

Metric	Pre-PGO	Post-PGO	Difference	Percentage Gain/Loss
Total Requests	1,557,674	1,626,603	+68,929 requests	+4.43%
Total Data Transferred	1.47 GB	1.54 GB	+0.07 GB	+4.76%
Requests/sec	26,100.68	27,257.19	+1,156.51 requests/sec	+4.43%
Transfer/sec	25.29 MB	26.41 MB	+1.12 MB/sec	+4.43%
Overall Requests/sec	25,959.83	27,108.41	+1,148.58 requests/sec	+4.42%
Overall Transfer/sec	25.15 MB	26.27 MB	+1.12 MB/sec	+4.45%
Fastest Request	65 µs	66 µs	+1 µs	-1.54%
Average Request Time	382 µs	366 µs	-16 µs	+4.19%
Slowest Request	56.609 ms	57.569 ms	+0.960 ms	-1.70%
Standard Deviation	651 µs	655 µs	+4 µs	-0.61%

Percentile Response Times

Percentile	Pre-PGO	Post-PGO	Difference	Percentage Gain/Loss
10%	82 µs	80 µs	-2 µs	+2.44%
50%	90 µs	88 µs	-2 µs	+2.22%
75%	94 µs	90 µs	-4 µs	+4.26%
99%	96 µs	92 µs	-4 µs	+4.17%
99.9%	97 µs	92 µs	-5 µs	+5.15%
99.9999%	97 µs	92 µs	-5 µs	+5.15%
99.99999%	97 µs	92 µs	-5 µs	+5.15%

In summary:

Requests/sec and Total Data Transferred saw a +4.43% and +4.76% improvement, respectively.
Average Request Time decreased by 16 µs, showing a +4.19% improvement.
Slowest Request and Fastest Request showed minor changes with slight percentage losses.
Percentile Response Times improved across the board, with the most significant gains in the higher percentiles, up to +5.15%.

These results highlight the performance improvements achieved with PGO, reflected in better throughput and reduced latency.

What’s not to like?

Laguages that support PGO

PGO is supported by many popular programming languages and compilers. Key examples include:

C/C++: Supported by GCC, Clang/LLVM, and MSVC with options like -fprofile-generate, -fprofile-use, and /LTCG:PGOptimize.
Go: Available from Go 1.20, using the pprof tool for profiling.
Rust: Supported via rustc with -Cprofile-generate and -Cprofile-use.
Java: Optimized by Oracle HotSpot JVM and GraalVM through JIT compilation.
.NET: The .NET Runtime uses PGO to optimize JIT compilation.
Swift: Leverages LLVM’s PGO infrastructure.
Python: CPython can be compiled with PGO using ./configure --enable-optimizations.
Fortran: Supported in GFortran as part of GCC with the same options as C/C++. Does anyone still write Fortran code?

Conclusion

This blog post explored Profile-Guided Optimization (PGO) and its potential benefits for improving code performance.

We used a simple Go application as an example, demonstrating the steps involved in applying PGO:

Compile and Run Without PGO: Establish a baseline performance metric.
Load Testing: Simulate real-world usage with tools like go-wrk.
Generate Profiling Data: Capture runtime behavior using pprof.
Compile and Run with PGO: Leverage profiling data for optimizations.
Analyze Results: Compare performance improvements and assess optimizations through techniques like flamegraph analysis.

The example showcased potential performance gains through PGO, including reduced execution time and optimized memory management. It’s important to note that specific results may vary depending on the application and the profiling data.

Why aren’t more developers leveraging the power of PGO? Please share your thoughts in the comments.