Closing the LLVM RISC-V gap to GCC, part 1

At the time of writing, GCC beats Clang on several SPEC CPU 2017 benchmarks on RISC-V¹:

LNT results comparing GCC and Clang

LLVM developers upstream have been working hard on the performance of generated code, in every part of the pipeline from the frontend all the way through to the backend. So when we first saw these results we were naturally a bit surprised. But as it turns out, the GCC developers have been hard at work too.

Sometimes a bit of healthy competition isn’t a bad thing, so this blog post is the first in a series looking at the work going on upstream to improve performance and catch up to GCC.

Please note that this series focuses on RISC-V. Other targets may have more competitive performance but we haven’t measured them yet. We’ll specifically be focusing on the high-performance application processor use case for RISC-V, e.g. compiling for a profile like RVA23. Unfortunately since we don’t have access to RVA23 compatible hardware just yet we’ll be benchmarking on a SpacemiT-X60 powered Banana Pi BPI-F3 with -march=rva22u64_v. We don’t want to use -mcpu=spacemit-x60 since we want to emulate a portable configuration that an OS distribution might compile packages with. And we want to include the vector extension, as we’ll see in later blog posts that optimization like auto-vectorization can have a major impact on performance.

Where to start?

It goes without saying that a vague task like “make LLVM faster” is easier said than done. The first thing is to find something to make fast, and while you could read through the couple dozen million lines of code in LLVM until inspiration strikes, it’s generally easier to start the other way around by analyzing the code it generates.

Sometimes you’ll get lucky by just stumbling across something that could be made faster when hacking or poring through generated assembly. But there’s an endless amount of optimizations to be implemented and not all of them are equally impactful. If we really want to make large strides in performance we need to take a step back and triage what’s actually worth spending time on.

LNT, LLVM’s nightly testing infrastructure, is a great tool for this task. It’s both a web server that allows you to analyze benchmark results, and a command line tool to help run the benchmarks and submit the results to said web server.

As the name might imply, it’s normally used for detecting performance regressions by running benchmarks daily with the latest revision of Clang, flagging any tests that may have become slower or faster since the last revision.

But it also allows you to compare benchmark results across arbitrary configurations. You can run experiments to see what effects a flag has, or see the difference in performance on two pieces of hardware.

Moreover, you can pass in different compilers. In our case, we can do two “runs” with Clang and GCC. Here’s how we would kick these off:

for CC in clang riscv64-linux-gnu-gcc
do
  lnt runtest test-suite bpi-f3-rva22u64_v-ReleaseLTO \
    --sandbox /var/lib/lnt/ \
    --test-suite=path/to/llvm-test-suite \
    -DTEST_SUITE_SPEC2017_ROOT=path/to/cpu2017 \
    --cc=$CC \
    --cflags="-O3 -flto -march=rva22u64_v" \
    --cxxflags="-O3 -flto -march=rva22u64_v" \
    --benchmarking-only \
    --build-threads=16 \
    # cross-compile and run on another machine over ssh
    --toolchain=rva22u64_v.cmake \	
    --remote-host=bpi-f3 \
    # fight noise, run each benchmark 3 times on the same core
    --exec-multisample=3 \
    --run-under="taskset -c 5" \
    # submit the results to a web server for easy viewing
    --submit=https://mylntserver.com/submitRun
done

This command does a lot of heavy lifting. First off it invokes CMake to configure a new build of llvm-test-suite and SPEC CPU 2017 with -O3 -flto -march=rva22u64_v. But because compiling the benchmarks on the Banana Pi BPI-F3 would be painfully slow, we’ve specified a CMake toolchain file to cross-compile to riscv64-linux-gnu from an x86-64 build machine. Here’s what the toolchain file looks like:

# rva22u64_v.cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(ARCH riscv64)

If you’ve got your cross toolchain sysroots set up in the right place in /usr/riscv64-linux-gnu/, things should “just work” and CMake will magically build RISC-V binaries. On Debian distros you can install the *-cross packages for this:

$ apt install libc6-dev-riscv64-cross libgcc-14-dev-riscv64-cross
libstdc++-12-dev-riscv64-cross

(You could also use mmdebstrap, or see Alex Bradbury’s guide to this)

After the benchmarks are built it rsyncs over the binaries to the remote machine, and then sshes into it to begin running the benchmarks. It will expect the sandbox path where the binaries are built on the remote to also exist on the host, so something like /var/lib/lnt should work across both. The BPI-F3 can also produce some noisy results, so the --exec-multisample=3 and -run-under="taskset -c 5" tell it to run the benchmarks multiple times and pin them to the same core.

Finally it generates a report.json file and submits it to the web server of choice. Navigate to the web interface and you’ll be shown two “machines”, LNT’s parlance for a specific combination of hardware, compiler and flags. You should see something like: bpif3-rva22u64_v-ReleaseLTO__clang_DEV__riscv64 and bpif3-rva22u64_v-ReleaseLTO__gcc_DEV__riscv64. Clicking into one of these machines will allow you to compare it against the other.

LNT UI for comparing results across two machines

Profiling

Once on the LNT web interface you’ll be presented with a list of benchmarks with a lot of red percentages beside them. We now know what is slower, but next we need to know why they’re slower. We need to profile these benchmarks to see where all the cycles are spent and to figure out what Clang is doing differently from GCC.

LNT makes this easy, all you need to do is add --use-perf=profile to the lnt runtest invocation and it will perform an additional run of each benchmark wrapped in perf record. Profiling impacts run time so it runs it separately to avoid interfering with the final results. If you want to override the default events that are sampled you can specify them with --perf-events=cycles:u,instructions:u,....

LNT will take care of copying back the collected profiles to the host machine and encoding them in the report, and in the web interface you’ll notice a “Profile” button beside the benchmark. Click on that and you’ll be brought to a side by side comparison of the profiles from the two machines:

From here you can dive in and see where the benchmark spends most of its time. Select a function from the dropdown and choose one with a particularly high percentage: This is how much it makes up overall of whatever counter is active in the top right, like cycles or instructions. Then do the same for the other run and you’ll be presented with the disassemblies side-by-side below. Most importantly, information about the counters is displayed inline with each instruction, much like the output of perf annotate.

You might find the per-instruction counter cycle information to be a bit too fine-grained, so personally I like to use the “Control-Flow Graph” view mode in the top left. This groups the instructions into blocks and lets you see which blocks are the hottest. It also shows the edges between branches and their destinations which makes identifying loops a lot easier.

A real example

Lets take a look at how we can use LNT’s web interface to identify something that GCC does but Clang doesn’t (but should). Going back to the list of SPEC benchmark results we can see 508.namd_r is about 17% slower, so hopefully we should find something to optimize in there.

Jumping into the profile we can see there’s a bunch of functions that all contribute a similar amount to the runtime. We’ll just pick the hottest one at 14.3%, ComputeNonbondedUtil::calc_pair_energy_fullelect(nonbonded*). It’s a pretty big function, but in GCC’s profile 71% of the dynamic instruction count comes from this single, albiet large block.

A hot block in the profile for GCC's 508.namd_r

Looking at Clang’s profile on the opposite side we see a similar block that accounts for 85% of the function’s instruction count. This slightly higher proportion is a small hint that the block that Clang’s producing is sub-optimal. If we take the hint and stare at it for long enough, one thing starts to stand out is that Clang generates a handful of fneg.d instructions which GCC doesn’t:

	fneg.d  fa0, fa0
	fneg.d  ft0, ft0
	fneg.d  ft2, ft2
	fmul.d  fa3, ft5, fa3
	fmul.d  fa0, fa3, fa0
	fmul.d  ft0, fa3, ft0
	fmul.d  fa3, fa3, ft2
	fmadd.d fa2, fa4, fa2, fa0
	fmadd.d ft6, fa4, ft6, ft0
	fmadd.d fa4, fa4, ft1, fa3

fneg.d rd, rs1 negates a double and fmul.d multiplies two doubles. fmadd.d rd, rs1, rs2, rs3 computes (rs1*rs2)+rs3, so here we’re doing some calculation like (a*b)+(c*-d).

These fneg.ds and fmadd.ds are missing on GCC. Instead it emits fmsub.d, which is entirely absent from the Clang code:

	fmul.d fa1,fa4,fa1
	fmul.d ft10,fa4,fa5
	fmsub.d ft10,ft7,fa0,ft10
	fmsub.d fa5,ft7,fa5,fa1
	fmul.d fa1,fa4,fa1
	fmsub.d fa1,ft7,fa0,fa1

fmsub.d rd, rs1, rs2, rs3 computes (rs1*rs2)-rs3, so GCC is instead doing something like (a*b)-(c*d) and in doing so avoids the need for the fneg.d. This sounds like a missed optimization in LLVM, so lets take a look at fixing it.

Writing the (right) fix

The LLVM RISC-V scalar backend is pretty mature at this stage so it’s surprising that we aren’t able to match fmsub.d. But if you take a look in RISCVInstrInfoD.td, you’ll see that the pattern already exists:

// fmsub: rs1 * rs2 - rs3
def : Pat<(any_fma FPR64:$rs1, FPR64:$rs2, (fneg FPR64:$rs3)),
          (FMSUB_D FPR64:$rs1, FPR64:$rs2, FPR64:$rs3, FRM_DYN)>;

We’ll need to figure out why this pattern isn’t getting selected, so lets start by extracting the build commands so we can look under the hood and dump the LLVM IR:

$ cmake -B build -C cmake/caches/ReleaseLTO.cmake --toolchain=...
$ ninja -C build 508.namd_r -t clean
$ ninja -C build 508.namd_r -v
...
[44/45] : && llvm-project/build.release/bin/clang++ --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r

This is an LTO build so the code generation step is actually happening during link time. To dump the IR we can copy and paste the link command from the verbose output and append -Wl,--save-temps to it, which in turn tells the Clang driver to pass --save-temps to the linker².

$ llvm-project/build.release/bin/clang++ -Wl,--save-temps --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r 
$ ls External/SPEC/CFP2017rate/508.namd_r/508.namd_r*
External/SPEC/CFP2017rate/508.namd_r/508.namd_r
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.0.preopt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.2.internalize.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.4.opt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.5.precodegen.bc

The bitcode is dumped at various stages, and 508.namd_r.0.5.precodegen.bc is the particular stage we’re looking for. This is after all the middle-end optimisations have run and is as close as we’ll get before the backend begins. It contains the bitcode for the entire program though, so lets find the symbol for the C++ function and extract just that corresponding LLVM IR function:

$ llvm-objdump -t 508.namd_r | grep calc_pair_energy_fullelect
...
000000000004562e l     F .text	0000000000001c92 _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded
$ llvm-extract -f 508.namd_r.0.5.precodegen.bc --func _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded \
  | llvm-dis > calc_pair_energy_fullelect.precodegen.ll

Now quickly grep the disassembled LLVM IR to see if we can find the source of the fnegs:

  %316 = fneg double %315
  %neg = fmul double %mul922, %316
  %317 = tail call double @llvm.fmuladd.f64(double %mul919, double %314, double %neg)

This looks promising. We have a @llvm.fmuladd that’s being fed by a fmul of a fneg, which is similar to the (a*b)+(c*-d) pattern in the resulting assembly. But looking back to our TableGen pattern for fmsub.d, we want (any_fma $rs1, $rs2, (fneg $rs3)), i.e. a llvm.fmuladd fed by a fneg of a fmul.

One thing about floating point arithmetic is that whilst it’s generally not associative, we can hoist out the fneg from the fmul since all negation does is flip the sign bit. So we can try to teach InstCombine to hoist the fneg outwards like (fmul x, (fneg y)) -> (fneg (fmul x, y)). But if we go to try that out we’ll see that InstCombine already does the exact opposite:

Instruction *InstCombinerImpl::visitFNeg(UnaryOperator &I) {
  Value *Op = I.getOperand(0);
  // ...
  Value *OneUse;
  if (!match(Op, m_OneUse(m_Value(OneUse))))
    return nullptr;

  if (Instruction *R = hoistFNegAboveFMulFDiv(OneUse, I))
    return replaceInstUsesWith(I, R);
  // ...
}

Instruction *InstCombinerImpl::hoistFNegAboveFMulFDiv(Value *FNegOp,
                                                      Instruction &FMFSource) {
  Value *X, *Y;
  if (match(FNegOp, m_FMul(m_Value(X), m_Value(Y)))) {
    // Push into RHS which is more likely to simplify (const or another fneg).
    // FIXME: It would be better to invert the transform.
    return cast<Instruction>(Builder.CreateFMulFMF(
        X, Builder.CreateFNegFMF(Y, &FMFSource), &FMFSource));
  }

InstCombine usually has good reasons for canonicalizing certain IR patterns, so we need to seriously reconsider if we want to change the canonical form. InstCombine affects all targets and it could be the case that some other backends have patterns that match fmul (fneg x, y), in which case we don’t want disturb them. However for RISC-V we know what our patterns for instruction selection are and what form we want our incoming IR to be in. So a much better place to handle this in is in RISCVISelLowering.cpp, which lets us massage it into shape at the SelectionDAG level, in a way that’s localized to just our target. “Un-canonicalizing” the IR is a common task that backends end up performing, and this is what the resulting patch ended up looking like:

--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -20248,6 +20248,17 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
       return V;
     break;
   case ISD::FMUL: {
+    using namespace SDPatternMatch;
+    SDLoc DL(N);
+    EVT VT = N->getValueType(0);
+    SDValue X, Y;
+    // InstCombine canonicalizes fneg (fmul x, y) -> fmul x, (fneg y), see
+    // hoistFNegAboveFMulFDiv.
+    // Undo this and sink the fneg so we match more fmsub/fnmadd patterns.
+    if (sd_match(N, m_FMul(m_Value(X), m_OneUse(m_FNeg(m_Value(Y))))))
+      return DAG.getNode(ISD::FNEG, DL, VT,
+                         DAG.getNode(ISD::FMUL, DL, VT, X, Y));
+

And if we rebuild our benchmark after applying it, we can see we the fmsub.ds getting matched, saving a couple of instructions:

@@ -983,18 +983,15 @@
        fld     ft2, 48(a5)
        fld     ft3, 64(a5)
        fld     ft4, 72(a5)
-       fneg.d  fa0, fa0
-       fneg.d  ft0, ft0
-       fneg.d  ft2, ft2
        fmul.d  fa3, ft5, fa3
        fmul.d  fa0, fa3, fa0
        fmul.d  ft0, fa3, ft0
        fmul.d  fa3, fa3, ft2
        fld     ft2, 0(s1)
        fmul.d  fa4, ft5, fa4
-       fmadd.d fa2, fa4, fa2, fa0
-       fmadd.d ft6, fa4, ft6, ft0
-       fmadd.d fa4, fa4, ft1, fa3
+       fmsub.d fa2, fa4, fa2, fa0
+       fmsub.d ft6, fa4, ft6, ft0
+       fmsub.d fa4, fa4, ft1, fa3

All in all this ended up giving a 1.77% improvement in instruction count for the 508.namd_r benchmark. It’s still not nearly as fast as GCC, but we’re a little bit closer than before we started.

What’s next?

Hopefully this has given you an overview of how to identify opportunities for optimization in LLVM, and what a typical fix might look like. The analysis is really the most important part, but if you don’t feel like setting up an LNT instance yourself locally Igalia runs one at cc-perf.igalia.com³. We run llvm-test-suite and SPEC CPU 2017 nightly built with Clang and GCC on a small set of RISC-V hardware⁴, but hopefully to be expanded in future. Feel free to use it to investigate some of the differences between Clang and GCC yourself, and maybe you’ll find some inspiration for optimizations.

In the next post in this series I’ll talk about a performance improvement that recently landed related to cost modelling.

Compiled with -march=rva22u64_v -O3 -flto, running the train dataset on a 16GB Banana Pi BPI-F3 (SpacemiT X60), with GCC and Clang from ToT on 2025-11-25. ↩
LLD in this case, configurable through CMake with -DCMAKE_LINKER_TYPE=LLD. ↩
The LLVM foundation is also in the process of rebooting its canonical public server, which should hopefully be up and running in the coming months. ↩
Currently it consists of a few Banana Pi BPI-F3s and some HiFive Premier P550s, the latter of which were generously donated by RISC-V International. ↩