Three areas where AI can help SPE
In an earlier post, Bruce talked about how the weight of improving computer performance has shifted to software, algorithms, and hardware architecture, and how that shift makes Software Performance Engineering (SPE) more important than ever in the post-Moore world. With AI becoming more powerful and drawing so much attention, I started wondering if there are opportunities to use it to maximize performance on today’s complex and heterogeneous systems. This post is a summary of what I found.
To chart the territory, I reviewed the materials from MIT's Performance Engineering of Software Systems course, which covers everything from profiling and benchmarking to cache behavior, vectorization, and parallelism. The course makes it clear: performance work spans a wide range of topics, and doing it well requires both deep systems knowledge and methodical analysis. To keep things focused, I decided to zoom in on three areas where AI seems most promising:
Algorithm discovery
Performance begins with the algorithm. The biggest gains often come before the first counter is read, when you decide what work to do at all. Machine learning systems are now able to explore code spaces at a scale and depth that no human can match, uncovering solutions that are more efficient.
One example is DeepMind's AlphaDev, which is a deep reinforcement learning system that discovers faster, low-level sorting algorithms from scratch. It outperforms human benchmarks by optimizing directly for latency at the assembly level, and its algorithms have been integrated into the LLVM C++ standard library, improving sorting performance by up to 70% on small inputs
Another example is AlphaTensor, also from DeepMind, which applies similar learning techniques to matrix multiplication. The system discovered new algorithms that reduced the number of multiplications needed for 4×4 binary matrices from 49 to 47, surpassing Strassen’s result from 1969. AlphaTensor also generated faster algorithms for real-number matrices and showed hardware-specific gains, improving throughput by 8.5 percent on Nvidia V100 GPUs and by 10.3 percent on TPUs.
Compiler decisions
Once the algorithm is in place, the compiler decides how to translate it into executable code. This involves hundreds of small decisions about inlining, register allocation, vectorization, and pass ordering. Traditional compilers use fixed heuristics for these tasks, but those heuristics are not always optimal, especially for large and complex applications.
Google's MLGO replaces two of LLVM’s most hand-tuned heuristics with policies learned from actual production builds. It currently supports inlining-for-size and register-allocation-for-performance. MLGO has been deployed in Chrome and Android, where it has reduced binary sizes by up to 20 percent, all without requiring changes to source code.
Meta's RLCompOpt takes a different approach. It uses reinforcement learning to select from a small set of highly effective compiler pass sequences called coreset. A graph neural network learns to select the right sequence for each input program. This approach has outperformed standard pass ordering techniques like -O3 and -Oz in terms of code size reduction.
Intel’s NeuroVectorizer focuses on loop vectorization. It uses deep reinforcement learning to learn optimal vectorization and interleaving strategies by analyzing code embeddings. Integrated into LLVM, NeuroVectorizer consistently outperforms LLVM’s built-in vectorizer on compute-intensive loops.
Microarchitecture tuning
Even the best code can fall short if it fails to align with the underlying hardware. Cache behavior, core affinity, and memory allocation all affect performance, and the optimal configuration often changes as the workload evolves. This is an area where AI can help software adapt dynamically to hardware behavior.
RL-CoPref is a reinforcement learning framework that manages hardware prefetchers in real time. It learns to turn prefetchers on or off depending on access patterns, using tile coding and reward signals based on prefetch accuracy and bandwidth usage. It has shown consistent improvements in IPC across a range of workloads.
In memory management, a recent study used reinforcement learning to train a dynamic allocator that outperforms standard schemes like first-fit and best-fit. The allocator learns to make smarter decisions based on recent allocation history and the current state of memory, reducing fragmentation and improving performance under pressure.
Thread scheduling has also seen progress. A 2024 paper used neural networks and LSTM models to predict thread performance across different cores in a heterogeneous CPU. Based on these predictions, the scheduler maps threads to the most suitable cores. Compared to traditional schedulers like CFS, this approach improved system throughput by around 20 percent.


