Over the past few months, I’ve had numerous discussions with practitioners and colleagues on the benefits of Profile-Guided Optimization (PGO). While it’s a topic that generates significant interest, many find it challenging to get started or simply lack the time to explore it fully. As a Product Manager in the continuous profiling domain, my curiosity drove me to delve deeper into this subject. After studying various academic papers and articles, I decided to implement PGO myself, benchmarking its impact to assess its true value.
My primary goal was to understand the challenges hindering PGO adoption and to identify key questions that could reveal customers’ real pain points. Additionally, I aimed to explore the business value for end users. Specifically, I wanted to quantify how PGO impacts critical business KPIs such as conversion rates, latency, and even SLOs and SLAs.
This blog summarizes my initial findings.
Key Takeaways:
-
Efficient software is both cheaper and greener 🌿.
-
PGO can boost your code’s efficiency by up to
14%
for free—without requiring any code changes. -
A practical guide to implementing PGO is presented, including insights on measuring compute and end-user performance gains using inlining output, binary size, go-wrk, and flamegraphs.
-
In the example Go code provided, my analysis revealed a notable performance gain of
~ 6.92%
in compute efficiency—an impressive result considering it’s based on a small JSON unmarshalling task. The potential savings in a production environment could be even more substantial. -
Many developers and SREs are missing out on potential cost savings by not leveraging PGO. You can learn from Cloudflare’s experience in reducing costs through PGO.
-
Continuous profiling in production is essential to unlock the benefits of PGO fully.
PGO: Your Code’s Personal Coach for Peak Performance
Imagine you’re an athlete training for the Olympics. A generic training regimen might get you started, but it won’t fully optimize your performance. This is why most athletes hire a personal coach. A personal coach observes your training, identifies areas for improvement, and adjusts your regimen accordingly. This personalized approach leads to better results in a shorter time.
PGO is like that coach for your code. It analyzes your code’s performance, identifies frequently executed code paths, and feeds that data to the compiler to refine its optimization decisions. By informing the compiler of your code’s hotspots, PGO helps your code achieve peak performance, just as a coach helps an athlete win an Olympic gold medal.
To connect the dots, the way a coach observes an athlete’s training regimen and physiological components is similar to profiling
in this context. Just as a coach scouts their athletes, continuous profiling analyzes your code’s behavior on an ongoing basis. Additionally, your gold, as a developer or SRE, could be promotion, a reduction in carbon footprint and/or lowering cloud spend.
PGO is an advanced optimization technique that leverages profiling data to guide the compiler in making more informed optimization decisions. By using PGO, the compiler can optimize hot code paths, potentially leading to significant performance improvements.
Further, PGOs can improve your code’s resource utilization efficiency by up to 14%
. That’s a lot of cost savings in most medium - to large organizations without making any code changes, that’s free money on the table!
As of Go 1.22, benchmarks for a representative set of Go programs show that building with PGO improves performance by around 2-14%. We expect performance gains to generally increase over time as additional optimizations take advantage of PGO in future versions of Go. (source: PGO Overview, Google)
Cloudflare recently shared a blog post detailing how they significantly reduced cloud spending by implementing PGO.
This indicates that following the release, we’re using ~97 cores fewer than before the release, a ~3.5% reduction. (source:Colin Douch, Cloudflare)
Comparing the before and after flamegraphs of PGO. See details in the results section below
Compilers and the One-Size-Fits-All Problem
As described in the athlete analogy above (inspired by the recent 2024 Paris Olympics), one of the compiler’s primary tasks is to make optimal decisions about your code during compilation. Compilers come equipped with heuristics that guide various optimization techniques—such as dead code elimination, register allocation, constant folding, and function inlining. These strategies are designed to streamline your code and enhance its efficiency.
However, the heuristics that guide the compiler’s decisions have the same problem inherent in most one-size-fits-all designs. There are limits to what a compiler can do on its own. For example, while inlining functions can reduce the overhead of function calls, the compiler can’t inline everything. Inlining too much can lead to bloated binaries, increased cache usage, and ultimately, performance degradation.
This is where PGO comes in. By providing the compiler with profiling data—information about how your code actually runs in real-world scenarios—the compiler can make more informed decisions. It can identify the “hot” functions that are frequently called and optimize them more aggressively, such as by inlining these critical functions or better allocating registers. This targeted optimization helps reduce the overhead of function calls and can lead to significant performance gains. PGO allows the compiler to go beyond generic optimizations, tailoring the final binary to the specific needs of your program based on actual usage.
How does PGO work?
Several other languages support PGO. See Laguages that Support PGO section. In this example, let’s consider a simple Go application that unmarshals JSON data containing personal bios. The application reads data from a file, processes it, and then serves it over HTTP. Here’s the code https://github.com/iogbole/go-pgo
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
_ "net/http/pprof" // Import pprof for profiling
"os"
)
type BioWrapper struct {
Bio Bio `json:"bio"`
}
//truncated
// extractNames is a leaf function that processes course names.
func extractNames(data []string) []string {
names := make([]string, 0, len(data))
for _, name := range data {
names = append(names, name)
}
return names
}
func main() {
// Set up the HTTP server with pprof enabled
http.HandleFunc("/", processor)
log.Println("Starting server on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
// processor handles incoming HTTP requests, processes the JSON, and returns the JSON data.
func processor(w http.ResponseWriter, r *http.Request) {
// Read the JSON file
file, err := os.ReadFile("./bio.json")
if err != nil {
http.Error(w, "Error reading bio.json", http.StatusInternalServerError)
return
}
// Unmarshal the JSON into the BioWrapper struct
var bioWrapper BioWrapper
err = json.Unmarshal(file, &bioWrapper)
if err != nil {
http.Error(w, "Error decoding JSON", http.StatusBadRequest)
return
}
// Access the Bio data
bio := bioWrapper.Bio
// Process the Bio data (e.g., print to console)
fmt.Println("Name:", bio.PersonalInfo.Name)
fmt.Println("University:", bio.Education.University.SchoolName)
fmt.Println("Current Job:", bio.WorkExperience.Job2.Role)
// Use the leaf function to extract course names
courseNames := extractNames(bio.Education.University.Courses)
fmt.Println("Courses:", courseNames)
// Set the response header to indicate JSON content
w.Header().Set("Content-Type", "application/json")
// Using json.MarshalIndent for pretty-printing
responseData, err := json.MarshalIndent(bioWrapper, "", " ")
if err != nil {
http.Error(w, "Error encoding JSON", http.StatusInternalServerError)
return
}
w.Write(responseData)
}
Step 1: Compile and Run Without PGO
First, let’s compile the program without PGO and see how it performs under load.
go build -o pre_pgo -gcflags="-m" main.go
./pre_pgo
Step 2: Load Testing with go-wrk
To simulate real-world usage, we can use go-wrk
, a Go-based benchmarking tool, to send concurrent HTTP requests to the service. Install it from here.
go-wrk -d 20 http://localhost:8080
This command runs the load test for 60 seconds, generating HTTP requests to the server. The results will give us a baseline of how the program performs without PGO.
Step 3: Generate Profiling Data
While that’s running, let’s profile the code for 30 seconds.
curl -o default.pgo "http://localhost:8080/debug/pprof/profile?seconds=30"
Note that default.pgo
is the pprof
output. An alternative approach is to use the pprof tool.
go tool pprof -output=cpu.prof http://localhost:8080/debug/pprof/profile?seconds=60
Step 4: Compile and Run post PGO
Now, let’s compile the code with PGO enabled using the profiling data we collected:
go build -o post_pgo -pgo=auto -gcflags="-m" main.go
./post_pgo
Step 5: Add load and Profile
Similar to step 3, generate load and profile - but this time, change the output of the curl to post_pgo.pprof
.
curl -o post_pgo.pprof "http://localhost:8080/debug/pprof/profile?seconds=30"
The Results!!
This was my first experience with PGO, and I was eager to see if it was worth the effort. I’ve saved all the raw benchmark files in the GitHub repository for future comparisons.
Let’s begin!
Inlining Optimization Analysis
Inlining is a compiler optimization technique that directly inserts the body of a function into the calling code, eliminating function call overhead and potentially enabling further optimizations.
Binary Size: Inlining Has a Small Storage Cost
A quick way to check for better inlining is to examine the binary size. While inlining can improve performance, it may increase the binary size due to code duplication.
Let’s check it out:
$ ls -atrl
-rwxr-xr-x 1 israel staff 8162050 Aug 23 18:45 pre_pgo
-rwxr-xr-x 1 israel staff 8244658 Aug 23 21:52 post_pgo
The difference in file sizes between the pre_pgo
and post_pgo
binaries can be summarized as follows:
File | Size | Difference | Analysis |
---|---|---|---|
pre_pgo |
8,162,050 bytes | N/A | This is the size of the binary before applying PGO. |
post_pgo |
8,244,658 bytes | 82,608 bytes larger i.e ~ 1.01% larger | The binary size increased slightly after PGO, likely due to additional inlining and optimizations. |
Analysis of the compiler output
For the record, I nearly jumped out of my chair when I realized that the
processor
function was inlined. It makes all the difference, especially as it’s a leaf function.
Let’s dig a bit deeper by comparing the build outputs:
Before PGO
Refer to the repo for the full build output
Now let’s compare the compiler output for PGO.
After PGO
Refer to the repo for the full build output
The inlining optimization looks way better! Here’s a table comparing the key differences between the compiler output with and without PGO, based on the provided gcflags="-m"
output:
Metric | Post PGO | Pre PGO | Difference | Analysis |
---|---|---|---|---|
Inlining | extractNames , fmt.Println , http.ListenAndServe , etc. |
extractNames , processor , http.(*response).Write , etc. |
More functions inlined, including processor and HTTP response writes |
More extensive inlining in PGO due to better optimization hints |
Parameter Leaks | Multiple parameters escape to the heap | Similar, with some optimizations such as better handling in extractNames |
Slight improvement in handling heap allocation in PGO | PGO provides better guidance for optimizing heap allocations |
Moved to Heap | Variables like bios moved to heap |
Variables like bioWrapper moved to heap |
Slight change in what gets moved to heap | Heap allocations are slightly optimized with PGO, but not eliminated |
PGO Devirtualization | Not applicable | PGO devirtualizes interface calls | Devirtualization is applied | PGO enables devirtualization, improving performance by optimizing interface calls |
Leaking Param | data , w , etc. escape to heap |
Similar parameters leak but with better optimizations such as extractNames |
Slight reduction in the number of heap escapes | PGO guides more efficient memory management, but some leaks persist |
Flamegrapsh analysis
Using https://www.speedscope.app, I uploaded the before and after pprof files for quick analysis. Since Speedscope lacks a differential flamegraph feature, I compared both flamegraphs side-by-side.
But first, let’s identify the top functions and verify the weight of the callers and callees.
The sandwitch view - showing callers and callees
The sandwich view shows that we need to focus on the main.processor
function, which is somewhat expected. Let’s verify this hypothesis with a flamegraph.
Differential flamegraphs
The table below summarizes the analysis of the before and after PGO flamegraphs
Metric | Pre PGO (Left) | Post PGO (Right) | Time Gained/Lost | Percentage Improvement |
---|---|---|---|---|
main.processor Total Time |
48.25 seconds | 44.91 seconds | 3.34 seconds gained | 6.92% improvement |
main.processor Self Time |
10.00ms | 0.00ns | 10.00ms gained | 100% improvement |
HTTP Handler Functions Total Time | Higher | Lower | Not Quantified | Slight improvement (qualitative) |
Syscall Execution Time | Higher | Lower | Not Quantified | Slight improvement (qualitative) |
Looking a bit further, the 6.92%
improvement is most likely derived by inlining the main.processor
function.
End-user/business impact analysis
The flamegraph analysis revealed a 6.92%
improvement in overall code efficiency. While this is a substantial gain, especially considering cloud costs, it’s crucial to understand how it translates into real-world benefits for end users.
To quantify the impact, I turned to the go-wrk
benchmarking data, located here. By analyzing these metrics, I can assess the practical performance improvements that users will experience.
Let’s delve into the details and see how these optimizations translate into tangible benefits
Baseline Pre_PGO output
and the comparison post pgo go-wrk result is:
Let’s delve into the details and see how these optimizations could translate to business KPIs.
Metric | Pre-PGO | Post-PGO | Difference | Percentage Gain/Loss |
---|---|---|---|---|
Total Requests | 1,557,674 | 1,626,603 | +68,929 requests | +4.43% |
Total Data Transferred | 1.47 GB | 1.54 GB | +0.07 GB | +4.76% |
Requests/sec | 26,100.68 | 27,257.19 | +1,156.51 requests/sec | +4.43% |
Transfer/sec | 25.29 MB | 26.41 MB | +1.12 MB/sec | +4.43% |
Overall Requests/sec | 25,959.83 | 27,108.41 | +1,148.58 requests/sec | +4.42% |
Overall Transfer/sec | 25.15 MB | 26.27 MB | +1.12 MB/sec | +4.45% |
Fastest Request | 65 µs | 66 µs | +1 µs | -1.54% |
Average Request Time | 382 µs | 366 µs | -16 µs | +4.19% |
Slowest Request | 56.609 ms | 57.569 ms | +0.960 ms | -1.70% |
Standard Deviation | 651 µs | 655 µs | +4 µs | -0.61% |
Percentile Response Times
Percentile | Pre-PGO | Post-PGO | Difference | Percentage Gain/Loss |
---|---|---|---|---|
10% | 82 µs | 80 µs | -2 µs | +2.44% |
50% | 90 µs | 88 µs | -2 µs | +2.22% |
75% | 94 µs | 90 µs | -4 µs | +4.26% |
99% | 96 µs | 92 µs | -4 µs | +4.17% |
99.9% | 97 µs | 92 µs | -5 µs | +5.15% |
99.9999% | 97 µs | 92 µs | -5 µs | +5.15% |
99.99999% | 97 µs | 92 µs | -5 µs | +5.15% |
In summary:
- Requests/sec and Total Data Transferred saw a +4.43% and +4.76% improvement, respectively.
- Average Request Time decreased by 16 µs, showing a +4.19% improvement.
- Slowest Request and Fastest Request showed minor changes with slight percentage losses.
- Percentile Response Times improved across the board, with the most significant gains in the higher percentiles, up to +5.15%.
These results highlight the performance improvements achieved with PGO, reflected in better throughput and reduced latency.
What’s not to like?
Laguages that support PGO
PGO is supported by many popular programming languages and compilers. Key examples include:
- C/C++: Supported by GCC, Clang/LLVM, and MSVC with options like
-fprofile-generate
,-fprofile-use
, and/LTCG:PGOptimize
. - Go: Available from Go 1.20, using the
pprof
tool for profiling. - Rust: Supported via
rustc
with-Cprofile-generate
and-Cprofile-use
. - Java: Optimized by Oracle HotSpot JVM and GraalVM through JIT compilation.
- .NET: The .NET Runtime uses PGO to optimize JIT compilation.
- Swift: Leverages LLVM’s PGO infrastructure.
- Python: CPython can be compiled with PGO using
./configure --enable-optimizations
. - Fortran: Supported in GFortran as part of GCC with the same options as C/C++. Does anyone still write Fortran code?
Conclusion
This blog post explored Profile-Guided Optimization (PGO) and its potential benefits for improving code performance.
We used a simple Go application as an example, demonstrating the steps involved in applying PGO:
- Compile and Run Without PGO: Establish a baseline performance metric.
- Load Testing: Simulate real-world usage with tools like
go-wrk
. - Generate Profiling Data: Capture runtime behavior using
pprof
. - Compile and Run with PGO: Leverage profiling data for optimizations.
- Analyze Results: Compare performance improvements and assess optimizations through techniques like flamegraph analysis.
The example showcased potential performance gains through PGO, including reduced execution time and optimized memory management. It’s important to note that specific results may vary depending on the application and the profiling data.
Why aren’t more developers leveraging the power of PGO? Please share your thoughts in the comments.