The due diligence of optimizing parallel BFS
How RGBSM worked hard to win the Fastcode Programming Challenge
This is my third post on the Fastcode Programming Challenge (FCPC). Previously, I reported on the overall competition and Track 1 (SSSP).
RGBSM wins BFS track at FCPC
Congratulations to RGBSM for besting a crowded field of talented coders to win Track 1 (BFS) at last month’s Fastcode Programming Challenge (FCPC). We received 13 submissions for BFS. The competition was fierce, but RGBSM left no doubt about overall victory, as the team scored best for both low-diameter and high-diameter graphs. Here are the overall edges-per-second scores of the top three teams:
660.77M — RGBSM (IIT Tirupati)
592.43M — TD-Badger (University of Wisconsin at Madison)
525.73M — TuT (Institute of Computing Technology, Chinese Academy of Sciences)
The RGBSM team consists of second-year PhD student Marati Bhaskar and his advisor, Raghavendra Kanakagiri. Marati told me how RGBSM made it to the competition:
We have a course here on parallel computing, and my advisor told us about FCPC during the course. But I had my final exams at the time, so we actually started after December. At that point we had less than a month, but I fully dedicated that month to the competition.
What a month! But Marati was prepared, because for over a year his PhD focus has been high-performance computing. Specifically, he’s concentrating on distributing tensor contractions on heterogeneous platforms. He explained to me how he got into his field:
Performance engineering is interesting because nowadays all the machines are heterogeneous and have multiple nodes. If you’re just doing sequential programming, you’re leaving the resources behind. You need to use the resources efficiently. So that's what got me interested to learn about parallel programming.
To appreciate Track 1 at FCPC (both BFS and SSSP), it helps to know about direction-optimizing breadth-first search. This is the canonical domain knowledge telling us to use different algorithms for low-diameter and high-diameter BFS/SSSP. The original tech report title describes the distinction as “searching for a parent instead of fighting over children.” Very briefly, the traditional top-down BFS grows a tree from the inside with a frontier where nodes are “fighting over children,” but this is inefficient for low-diameter graphs. In that case, it’s better to go bottom-up by growing the tree from the outside with a frontier where nodes are “searching for a parent.” The figure below labels the two directions as “push” (fighting over children) and “pull” (searching for a parent).

I asked Marati about the RGBSM approach, and he sketched the basics:
We were implementing that algorithm. But there was overhead in calculating the frontier edges, so we thought, how can we reduce the overhead of switching between the top-down and bottom-up approaches? Then we came up with simple heuristic, to use the frontier size (i.e. nodes) as a proxy for counting the frontier edges.
Later in our conversation, I realized that much (most?) of what Marati learned in FCPC never made it into his paper:
Of the optimizations mentioned in our paper, most were pretty easy to implement. We tried implementing many other optimizations, which were often very difficult. We were sure they would help, but in the end, they didn't, and that was surprising for us. We worked very hard on threads — to launch each thread once and reuse threads without destroying them. That was especially challenging. And we tried using vector instructions and using local frontier flushing, for example. Logically, we believed they would help performance. But they didn’t, because the actual performance depends on the specific graphs being processed, and also on the underlying hardware.
Marati’s story sounds to me like a perfect example of Brian Wheatman’s essence of SPE: “doing your due diligence, and doing it well.”
How much diligence is due? I got a sense of the answer after I asked Marati what differentiated RGBSM from the competition.
The Standard Template Library (STL) is a very good library, but its goal is making things easier, not making the program as efficient as possible. So we didn't use any of the pre-built libraries. We wrote everything from scratch.