Profile-Guided Optimization: A Hands-On Guide to Reducing Computational Wastage

Over the past few months, I’ve had numerous discussions with practitioners and colleagues on the benefits of Profile-Guided Optimization (PGO). While it’s a topic that generates significant interest, many find it challenging to get started or simply lack the time to explore it fully. As a Product Manager in the continuous profiling domain, my curiosity drove me to delve deeper into this subject. After studying various academic papers and articles, I decided to implement PGO myself, benchmarking its impact to assess its true value.

My primary goal was to understand the challenges hindering PGO adoption and to identify key questions that could reveal customers’ real pain points. Additionally, I aimed to explore the business value for end users. Specifically, I wanted to quantify how PGO impacts critical business KPIs such as conversion rates, latency, and even SLOs and SLAs.

This blog summarizes my initial findings.

Key Takeaways:

  • Efficient software is both cheaper and greener 🌿.

  • PGO can boost your code’s efficiency by up to 14% for free—without requiring any code changes.

  • A practical guide to implementing PGO is presented, including insights on measuring compute and end-user performance gains using inlining output, binary size, go-wrk, and flamegraphs.

  • In the example Go code provided, my analysis revealed a notable performance gain of ~ 6.92% in compute efficiency—an impressive result considering it’s based on a small JSON unmarshalling task. The potential savings in a production environment could be even more substantial.

  • Many developers and SREs are missing out on potential cost savings by not leveraging PGO. You can learn from Cloudflare’s experience in reducing costs through PGO.

  • Continuous profiling in production is essential to unlock the benefits of PGO fully.

PGO: Your Code’s Personal Coach for Peak Performance

Imagine you’re an athlete training for the Olympics. A generic training regimen might get you started, but it won’t fully optimize your performance. This is why most athletes hire a personal coach. A personal coach observes your training, identifies areas for improvement, and adjusts your regimen accordingly. This personalized approach leads to better results in a shorter time.

PGO is like that coach for your code. It analyzes your code’s performance, identifies frequently executed code paths, and feeds that data to the compiler to refine its optimization decisions. By informing the compiler of your code’s hotspots, PGO helps your code achieve peak performance, just as a coach helps an athlete win an Olympic gold medal.

To connect the dots, the way a coach observes an athlete’s training regimen and physiological components is similar to profiling in this context. Just as a coach scouts their athletes, continuous profiling analyzes your code’s behavior on an ongoing basis. Additionally, your gold, as a developer or SRE, could be promotion, a reduction in carbon footprint and/or lowering cloud spend.

PGO is an advanced optimization technique that leverages profiling data to guide the compiler in making more informed optimization decisions. By using PGO, the compiler can optimize hot code paths, potentially leading to significant performance improvements.

Further, PGOs can improve your code’s resource utilization efficiency by up to 14%. That’s a lot of cost savings in most medium - to large organizations without making any code changes, that’s free money on the table!

As of Go 1.22, benchmarks for a representative set of Go programs show that building with PGO improves performance by around 2-14%. We expect performance gains to generally increase over time as additional optimizations take advantage of PGO in future versions of Go. (source: PGO Overview, Google)

Cloudflare recently shared a blog post detailing how they significantly reduced cloud spending by implementing PGO.

This indicates that following the release, we’re using ~97 cores fewer than before the release, a ~3.5% reduction. (source:Colin Douch, Cloudflare)

prom

Comparing the before and after flamegraphs of PGO. See details in the results section below

Compilers and the One-Size-Fits-All Problem

As described in the athlete analogy above (inspired by the recent 2024 Paris Olympics), one of the compiler’s primary tasks is to make optimal decisions about your code during compilation. Compilers come equipped with heuristics that guide various optimization techniques—such as dead code elimination, register allocation, constant folding, and function inlining. These strategies are designed to streamline your code and enhance its efficiency.

However, the heuristics that guide the compiler’s decisions have the same problem inherent in most one-size-fits-all designs. There are limits to what a compiler can do on its own. For example, while inlining functions can reduce the overhead of function calls, the compiler can’t inline everything. Inlining too much can lead to bloated binaries, increased cache usage, and ultimately, performance degradation.

This is where PGO comes in. By providing the compiler with profiling data—information about how your code actually runs in real-world scenarios—the compiler can make more informed decisions. It can identify the “hot” functions that are frequently called and optimize them more aggressively, such as by inlining these critical functions or better allocating registers. This targeted optimization helps reduce the overhead of function calls and can lead to significant performance gains. PGO allows the compiler to go beyond generic optimizations, tailoring the final binary to the specific needs of your program based on actual usage.

How does PGO work?

Several other languages support PGO. See Laguages that Support PGO section. In this example, let’s consider a simple Go application that unmarshals JSON data containing personal bios. The application reads data from a file, processes it, and then serves it over HTTP. Here’s the code https://github.com/iogbole/go-pgo

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof" // Import pprof for profiling
	"os"
)

type BioWrapper struct {
	Bio Bio `json:"bio"`
}

//truncated 

// extractNames is a leaf function that processes course names.
func extractNames(data []string) []string {
	names := make([]string, 0, len(data))
	for _, name := range data {
		names = append(names, name)
	}
	return names
}

func main() {
	// Set up the HTTP server with pprof enabled
	http.HandleFunc("/", processor)
	log.Println("Starting server on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

// processor handles incoming HTTP requests, processes the JSON, and returns the JSON data.
func processor(w http.ResponseWriter, r *http.Request) {
	// Read the JSON file
	file, err := os.ReadFile("./bio.json")
	if err != nil {
		http.Error(w, "Error reading bio.json", http.StatusInternalServerError)
		return
	}

	// Unmarshal the JSON into the BioWrapper struct
	var bioWrapper BioWrapper
	err = json.Unmarshal(file, &bioWrapper)
	if err != nil {
		http.Error(w, "Error decoding JSON", http.StatusBadRequest)
		return
	}

	// Access the Bio data
	bio := bioWrapper.Bio

	// Process the Bio data (e.g., print to console)
	fmt.Println("Name:", bio.PersonalInfo.Name)
	fmt.Println("University:", bio.Education.University.SchoolName)
	fmt.Println("Current Job:", bio.WorkExperience.Job2.Role)

	// Use the leaf function to extract course names
	courseNames := extractNames(bio.Education.University.Courses)
	fmt.Println("Courses:", courseNames)

	// Set the response header to indicate JSON content
	w.Header().Set("Content-Type", "application/json")

	// Using json.MarshalIndent for pretty-printing
	responseData, err := json.MarshalIndent(bioWrapper, "", "    ")
	if err != nil {
		http.Error(w, "Error encoding JSON", http.StatusInternalServerError)
		return
	}

	w.Write(responseData)
}

Step 1: Compile and Run Without PGO

First, let’s compile the program without PGO and see how it performs under load.

go build -o pre_pgo -gcflags="-m"  main.go

./pre_pgo

Step 2: Load Testing with go-wrk

To simulate real-world usage, we can use go-wrk, a Go-based benchmarking tool, to send concurrent HTTP requests to the service. Install it from here.

go-wrk -d 20 http://localhost:8080

This command runs the load test for 60 seconds, generating HTTP requests to the server. The results will give us a baseline of how the program performs without PGO.

Step 3: Generate Profiling Data

While that’s running, let’s profile the code for 30 seconds.

curl -o default.pgo "http://localhost:8080/debug/pprof/profile?seconds=30"

Note that default.pgo is the pprof output. An alternative approach is to use the pprof tool.

go tool pprof -output=cpu.prof http://localhost:8080/debug/pprof/profile?seconds=60

Step 4: Compile and Run post PGO

Now, let’s compile the code with PGO enabled using the profiling data we collected:

go build -o post_pgo -pgo=auto -gcflags="-m"  main.go

./post_pgo

Step 5: Add load and Profile

Similar to step 3, generate load and profile - but this time, change the output of the curl to post_pgo.pprof.

curl -o post_pgo.pprof  "http://localhost:8080/debug/pprof/profile?seconds=30"

The Results!!

This was my first experience with PGO, and I was eager to see if it was worth the effort. I’ve saved all the raw benchmark files in the GitHub repository for future comparisons.

Let’s begin!

Inlining Optimization Analysis

Inlining is a compiler optimization technique that directly inserts the body of a function into the calling code, eliminating function call overhead and potentially enabling further optimizations.

Binary Size: Inlining Has a Small Storage Cost

A quick way to check for better inlining is to examine the binary size. While inlining can improve performance, it may increase the binary size due to code duplication.

Let’s check it out:

$ ls -atrl

-rwxr-xr-x 1 israel staff 8162050 Aug 23  18:45 pre_pgo
-rwxr-xr-x 1 israel staff 8244658 Aug 23  21:52 post_pgo

The difference in file sizes between the pre_pgo and post_pgo binaries can be summarized as follows:

File Size Difference Analysis
pre_pgo 8,162,050 bytes N/A This is the size of the binary before applying PGO.
post_pgo 8,244,658 bytes 82,608 bytes larger i.e ~ 1.01% larger The binary size increased slightly after PGO, likely due to additional inlining and optimizations.

Analysis of the compiler output

For the record, I nearly jumped out of my chair when I realized that the processor function was inlined. It makes all the difference, especially as it’s a leaf function.

Let’s dig a bit deeper by comparing the build outputs:

Before PGO

carbon (4) Refer to the repo for the full build output

Now let’s compare the compiler output for PGO.

After PGO

post_pgo Refer to the repo for the full build output

The inlining optimization looks way better! Here’s a table comparing the key differences between the compiler output with and without PGO, based on the provided gcflags="-m" output:

Metric Post PGO Pre PGO Difference Analysis
Inlining extractNames, fmt.Println, http.ListenAndServe, etc. extractNames, processor, http.(*response).Write, etc. More functions inlined, including processor and HTTP response writes More extensive inlining in PGO due to better optimization hints
Parameter Leaks Multiple parameters escape to the heap Similar, with some optimizations such as better handling in extractNames Slight improvement in handling heap allocation in PGO PGO provides better guidance for optimizing heap allocations
Moved to Heap Variables like bios moved to heap Variables like bioWrapper moved to heap Slight change in what gets moved to heap Heap allocations are slightly optimized with PGO, but not eliminated
PGO Devirtualization Not applicable PGO devirtualizes interface calls Devirtualization is applied PGO enables devirtualization, improving performance by optimizing interface calls
Leaking Param data, w, etc. escape to heap Similar parameters leak but with better optimizations such as extractNames Slight reduction in the number of heap escapes PGO guides more efficient memory management, but some leaks persist

Flamegrapsh analysis

Using https://www.speedscope.app, I uploaded the before and after pprof files for quick analysis. Since Speedscope lacks a differential flamegraph feature, I compared both flamegraphs side-by-side.

But first, let’s identify the top functions and verify the weight of the callers and callees.

sandwitch The sandwitch view - showing callers and callees

The sandwich view shows that we need to focus on the main.processor function, which is somewhat expected. Let’s verify this hypothesis with a flamegraph.

speedscope Differential flamegraphs

The table below summarizes the analysis of the before and after PGO flamegraphs

Metric Pre PGO (Left) Post PGO (Right) Time Gained/Lost Percentage Improvement
main.processor Total Time 48.25 seconds 44.91 seconds 3.34 seconds gained 6.92% improvement
main.processor Self Time 10.00ms 0.00ns 10.00ms gained 100% improvement
HTTP Handler Functions Total Time Higher Lower Not Quantified Slight improvement (qualitative)
Syscall Execution Time Higher Lower Not Quantified Slight improvement (qualitative)

Looking a bit further, the 6.92% improvement is most likely derived by inlining the main.processor function.

End-user/business impact analysis

The flamegraph analysis revealed a 6.92% improvement in overall code efficiency. While this is a substantial gain, especially considering cloud costs, it’s crucial to understand how it translates into real-world benefits for end users. To quantify the impact, I turned to the go-wrk benchmarking data, located here. By analyzing these metrics, I can assess the practical performance improvements that users will experience. Let’s delve into the details and see how these optimizations translate into tangible benefits

Baseline Pre_PGO output

before

and the comparison post pgo go-wrk result is:

after

Let’s delve into the details and see how these optimizations could translate to business KPIs.

Metric Pre-PGO Post-PGO Difference Percentage Gain/Loss
Total Requests 1,557,674 1,626,603 +68,929 requests +4.43%
Total Data Transferred 1.47 GB 1.54 GB +0.07 GB +4.76%
Requests/sec 26,100.68 27,257.19 +1,156.51 requests/sec +4.43%
Transfer/sec 25.29 MB 26.41 MB +1.12 MB/sec +4.43%
Overall Requests/sec 25,959.83 27,108.41 +1,148.58 requests/sec +4.42%
Overall Transfer/sec 25.15 MB 26.27 MB +1.12 MB/sec +4.45%
Fastest Request 65 µs 66 µs +1 µs -1.54%
Average Request Time 382 µs 366 µs -16 µs +4.19%
Slowest Request 56.609 ms 57.569 ms +0.960 ms -1.70%
Standard Deviation 651 µs 655 µs +4 µs -0.61%

Percentile Response Times

Percentile Pre-PGO Post-PGO Difference Percentage Gain/Loss
10% 82 µs 80 µs -2 µs +2.44%
50% 90 µs 88 µs -2 µs +2.22%
75% 94 µs 90 µs -4 µs +4.26%
99% 96 µs 92 µs -4 µs +4.17%
99.9% 97 µs 92 µs -5 µs +5.15%
99.9999% 97 µs 92 µs -5 µs +5.15%
99.99999% 97 µs 92 µs -5 µs +5.15%

In summary:

  • Requests/sec and Total Data Transferred saw a +4.43% and +4.76% improvement, respectively.
  • Average Request Time decreased by 16 µs, showing a +4.19% improvement.
  • Slowest Request and Fastest Request showed minor changes with slight percentage losses.
  • Percentile Response Times improved across the board, with the most significant gains in the higher percentiles, up to +5.15%.

These results highlight the performance improvements achieved with PGO, reflected in better throughput and reduced latency.

What’s not to like?

Laguages that support PGO

PGO is supported by many popular programming languages and compilers. Key examples include:

  1. C/C++: Supported by GCC, Clang/LLVM, and MSVC with options like -fprofile-generate, -fprofile-use, and /LTCG:PGOptimize.
  2. Go: Available from Go 1.20, using the pprof tool for profiling.
  3. Rust: Supported via rustc with -Cprofile-generate and -Cprofile-use.
  4. Java: Optimized by Oracle HotSpot JVM and GraalVM through JIT compilation.
  5. .NET: The .NET Runtime uses PGO to optimize JIT compilation.
  6. Swift: Leverages LLVM’s PGO infrastructure.
  7. Python: CPython can be compiled with PGO using ./configure --enable-optimizations.
  8. Fortran: Supported in GFortran as part of GCC with the same options as C/C++. Does anyone still write Fortran code?

Conclusion

This blog post explored Profile-Guided Optimization (PGO) and its potential benefits for improving code performance.

We used a simple Go application as an example, demonstrating the steps involved in applying PGO:

  1. Compile and Run Without PGO: Establish a baseline performance metric.
  2. Load Testing: Simulate real-world usage with tools like go-wrk.
  3. Generate Profiling Data: Capture runtime behavior using pprof.
  4. Compile and Run with PGO: Leverage profiling data for optimizations.
  5. Analyze Results: Compare performance improvements and assess optimizations through techniques like flamegraph analysis.

The example showcased potential performance gains through PGO, including reduced execution time and optimized memory management. It’s important to note that specific results may vary depending on the application and the profiling data.

Why aren’t more developers leveraging the power of PGO? Please share your thoughts in the comments.

References