Welcome back to the second post in this series looking at how we can improve the performance of RISC-V code from LLVM.

Previously in part 1 we looked at how we can use LNT to analyze performance gaps, then identified and fixed a missed fmsub.d opportunity during instruction selection, giving a modest 1.77% speedup on a SPEC CPU 2017 benchmark.

In this post we’ll be improving another SPEC benchmark by 7% by teaching the loop vectorizer to make smarter cost modelling decisions. It involves a relatively non-trivial analysis, but thanks to LLVM’s modular infrastructure we can do it in just a handful of lines of code. Let’s get started.

Analysis

Just like last time, all fruitful performance work begins by analysing some workloads. In the last post we had already run some comparisons of SPEC CPU 2017 benchmarks on LNT, so we can return to those results and pick another benchmark to focus on. Here’s one that’s 12% slower than GCC:

Screenshot of LNT showing the 531.deepsjeng_r benchmark being 12.14% slower on Clang vs GCC

531.deepsjeng_r is a chess engine that tied first in the World Computer Chess Championships back in 2009. It consists of a lot bitwise arithmetic and complex loops, since the state of the game is encoded in 64 element arrays: one element for each square on the board. Unlike 508.namd_r from last time, there’s no floating point arithmetic.

Drilling into the profile and its list of functions, right off the bat we can see that one function is much slower on LLVM. On GCC qsearch(state_t*, int, int, int, int) makes up 9.1% of the overall cycles, but on LLVM it’s 16.1%. And if we click in on the function and view the cumulative total of cycles spent in user mode, Clang takes 74.6 billion cycles to do what takes GCC only 37.7 billion cycles.

Screenshot of Clang disassembly and GCC disassembly side by side, with inline total cumulative cycle annotations showing Clang taking 74.6 billion cycles and GCC taking 37.7 — Left shows Clang taking 74.6 billion cycles, right shows GCC taking 37.7 billion.

So there’s probably something we can improve upon here, but it’s not immediately obvious from staring at the disassembly. qsearch is a pretty big function with a couple hundred instructions, so switching to the CFG view gives us a better overview.

On LLVM’s side we see the offending loop that’s consuming so many cycles: It’s long, vectorized, and completely if-predicated: there’s no control flow inside the loop itself. This is typical of a loop that’s been auto-vectorized by the loop vectorized. If you look at the load and store instructions you can see that they are masked with the v0.t operand, stemming from the original control flow that was flattened.

Screenshot of the disassembly from Clang, showing a very hot block
with a lot of masked vector
instructions.

But on the GCC side there’s no equivalent vectorized loop. The loop is in there somewhere, but all the loops are still in their original scalar form with the control flow intact. And if we look at the edges coming from the loop headers, we can see that most of the time it visits one or two basic blocks and then branches back up to the header. Most of the blocks in the loop are completely cold.

Unfortunately the sources for deepsjeng aren’t open source so we can’t share them in this post, but the very rough structure of the loop is something like this:

for (i = 0; i < N; i++) {
    if (foo[i] == a) {
        if (bar[i] == b) {
            if (baz[i] == c) {
                qux[i] = 123;
                // lots of work here...
            }
        }
    }
}

For any given iteration, it’s statistically unlikely that we enter the first if statement. It’s even more unlikely that the second if’s condition is also true. And even more so for the third nested if where we eventually have lots of work to compute.

In a scalar loop this doesn’t matter because if an if statement’s condition is false, then we don’t execute the code inside it. We just branch back to the start of the loop. But with a vectorized loop, we execute every single instruction regardless of the condition.

This is the core of the performance gap that we’re seeing versus GCC: Given that the majority of the work in this loop is so deeply nested in the control flow, it would have been better to have not vectorized it given that we need to if-convert it.

Cost modelling

One of the hardest problems when making an optimizing compiler is to know when an optimization is profitable. Some optimizations are a double edged sword that can harm performance just as much as they can improve it (if not more), and loop vectorization falls squarely into this category. So rather than blindly applying optimizations at any given opportunity, LLVM has detailed cost models for each target to try and estimate how expensive or cheap a certain sequence of instructions is, which it can then use to evaluate whether or not a transform will be a net positive.

It’s hard to overstate the amount of effort in LLVM spent fine tuning these cost models, applying various heuristics and approximations to make sure different optimizations don’t shoot themselves in the foot. In fact there are some optimizations like loop distribute that are in-tree but disabled by default due to the difficulty in getting the cost model right.

So naturally, we would expect that the loop vectorizer already has a sophisticated solution for the problem we’re seeing in our analysis: Given any predicated block that’s if-converted during vectorization, we would expect the scalar cost for that block to be made slightly cheaper because the scalar block may not always be executed. And the less likely it is to be executed, the cheaper it should be — the most deeply nested if block should be discounted more than the outermost if block.

So how does the loop vectorizer handle this?

/// A helper function that returns how much we should divide the cost of a
/// predicated block by. Typically this is the reciprocal of the block
/// probability, i.e. if we return X we are assuming the predicated block will
/// execute once for every X iterations of the loop header so the block should
/// only contribute 1/X of its cost to the total cost calculation, but when
/// optimizing for code size it will just be 1 as code size costs don't depend
/// on execution probabilities.
///
/// TODO: We should use actual block probability here, if available. Currently,
///       we always assume predicated blocks have a 50% chance of executing.
inline unsigned
getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
  return CostKind == TTI::TCK_CodeSize ? 1 : 2;
}

We’ve come across a load bearing TODO here. Either the block is executed or its not, so it’s a fifty/fifty chance.

On its own this hardcoded probability doesn’t seem like an unreasonable guess. But whilst 50% may be an accurate estimate as to whether or not a branch will be taken, it’s an inaccurate estimate as to whether or not a block will be executed. Assuming that a branch has a 1/2 chance of being taken, the most deeply nested block in our example ends up having a 1/2 * 1/2 * 1/2 = 1/8 chance of being executed.

for (i = 0; i < N; i++) {
    if (foo[i] == a) {
        // 1/2 chance of being executed
        if (bar[i] == b) {
            // 1/4 chance of being executed
            if (baz[i] == c) {
                // 1/8 chance of being executed
                // ...
            }
        }
    }
}

The fix to get the loop vectorizer to not unprofitably vectorize this loop will be to teach getPredBlockCostDivisor to take into account control flow between blocks.

It’s worth mentioning the fact that a hardcoded constant managed to work well enough up until this point is the sign of an good trade off. 1% of the effort for 90% of the benefit. A patch can go off the rails very easily by trying to implement too much in one go, so deferring the more complex cost modelling here till later was an astute choice. Incremental development is key to making progress upstream.

VPlan cost modeling

To get a better picture of how the loop vectorizer is calculating the cost for each possible loop, lets start with a simplified LLVM IR reproducer:

; for (int i = 0; i < 1024; i++)
;   if (c0)
;     if (c1)
;       p1[p0[i]] = 0; // extra work to increase the cost in the predicated block

define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
entry:
  br label %loop

loop:
  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
  br i1 %c0, label %then.0, label %latch

then.0:
  br i1 %c1, label %then.1, label %latch

then.1:
  %gep0 = getelementptr i32, ptr %p0, i32 %iv
  %x = load i32, ptr %gep0
  %gep1 = getelementptr i32, ptr %p1, i32 %x
  store i32 0, ptr %gep1
  br label %latch

latch:
  %iv.next = add i32 %iv, 1
  %done = icmp eq i32 %iv.next, 1024
  br i1 %done, label %exit, label %loop

exit:
  ret void
}

We can run opt -p loop-vectorize -debug on this example to see how the loop vectorizer decides if it’s profitable to vectorize the loop or not:

$ opt -p loop-vectorize -mtriple riscv64 -mattr=+v nested.ll -disable-output -debug
...
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c0, label %then.0, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c1, label %then.1, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep0 = getelementptr i32, ptr %p0, i32 %iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %x = load i32, ptr %gep0, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep1 = getelementptr i32, ptr %p1, i32 %x
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 0, ptr %gep1, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   br label %latch
LV: Found an estimated cost of 1 for VF 1 For instruction:   %iv.next = add i32 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %done = icmp eq i32 %iv.next, 1024
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %done, label %exit, label %loop
LV: Scalar loop costs: 3.
...
Cost of 1 for VF vscale x 4: induction instruction   %iv.next = add i32 %iv, 1
Cost of 0 for VF vscale x 4: induction instruction   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
Cost of 1 for VF vscale x 4: exit condition instruction   %done = icmp eq i32 %iv.next, 1024
Cost of 0 for VF vscale x 4: EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
Cost of 0 for VF vscale x 4: EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI vp<%5> = phi ir<0>, vp<%index.evl.next>
Cost of 0 for VF vscale x 4: EMIT-SCALAR vp<%avl> = phi [ ir<1024>, vector.ph ], [ vp<%avl.next>, vector.body ]
Cost of 1 for VF vscale x 4: EMIT-SCALAR vp<%6> = EXPLICIT-VECTOR-LENGTH vp<%avl>
Cost of 0 for VF vscale x 4: vp<%7> = SCALAR-STEPS vp<%5>, ir<1>, vp<%6>
Cost of 0 for VF vscale x 4: CLONE ir<%gep0> = getelementptr ir<%p0>, vp<%7>
Cost of 0 for VF vscale x 4: vp<%8> = vector-pointer ir<%gep0>
Cost of 2 for VF vscale x 4: WIDEN ir<%x> = vp.load vp<%8>, vp<%6>, vp<%3>
Cost of 0 for VF vscale x 4: WIDEN-GEP Inv[Var] ir<%gep1> = getelementptr ir<%p1>, ir<%x>
Cost of 12 for VF vscale x 4: WIDEN vp.store ir<%gep1>, ir<0>, vp<%6>, vp<%3>
Cost of 0 for VF vscale x 4: EMIT vp<%index.evl.next> = add nuw vp<%6>, vp<%5>
Cost of 0 for VF vscale x 4: EMIT vp<%avl.next> = sub nuw vp<%avl>, vp<%6>
Cost of 0 for VF vscale x 4: EMIT vp<%index.next> = add nuw vp<%4>, vp<%0>
Cost of 0 for VF vscale x 4: EMIT branch-on-count vp<%index.next>, vp<%1>
Cost of 0 for VF vscale x 4: vector loop backedge
Cost of 0 for VF vscale x 4: EMIT-SCALAR vp<%bc.resume.val> = phi [ ir<0>, ir-bb<entry> ]
Cost of 0 for VF vscale x 4: IR   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ] (extra operand: vp<%bc.resume.val> from scalar.ph)
Cost of 0 for VF vscale x 4: EMIT vp<%3> = logical-and ir<%c0>, ir<%c1>
Cost for VF vscale x 4: 17 (Estimated cost per lane: 2.1)
...
LV: Selecting VF: vscale x 4.
LV: Minimum required TC for runtime checks to be profitable:0
LV: Interleaving is not beneficial.
LV: Found a vectorizable loop (vscale x 4) in nested.ll
LV: Vectorizing: innermost loop.
LEV: Unable to vectorize epilogue because no epilogue is allowed.
LV: Loop does not require scalar epilogue
LV: Loop does not require scalar epilogue
Executing best plan with VF=vscale x 4, UF=1

First we see it work out the cost of the original scalar loop, or as the vectorizer sees it, the loop with a vectorization factor (VF) of 1. It goes through each instruction calling into TargetTransformInfo, and arrives at a total scalar cost of 3. You might have noticed though, if you went through and manually summed up the individual instruction costs you would have gotten a total cost of 4. However the load and store instructions belong to the predicated then.1 block, so they have their cost divided by 2 from getPredBlockCostDivisor.

For the vectorized loop, the loop vectorizer uses VPlan to cost the one plan for a range of different VFs¹. VPlan is an IR specific to the loop vectorizer to help represent various vectorization strategies, which is why you see all the EMIT and WIDEN “recipes” in the output. It calculates a total cost for the loop and divides it by the estimated number of lanes — we’re working with scalable vectors on RISC-V so the target needs to make an estimate of what vscale is — and arrives at 2.1 per lane. There’s no predication discount applied here because it’s a vectorized loop. 2.1 is cheaper than 3, so it ultimately picks the vectorized loop.

BlockFrequencyInfo

Computing an accurate probability that a given block will be executed is a non-trivial task, but thankfully LLVM already has an analysis we can use for this called BlockFrequencyInfo.

BlockFrequencyInfo computes how often a block can be expected to execute relative to other blocks in a function. It in turn uses another analysis called BranchProbabilityInfo to work out how likely a branch to a specific block is going to be taken. And because BranchProbabilityInfo uses profiling information when available, it can give you much more accurate block frequencies when compiling with PGO. Otherwise it will fall back to guessing the probability of a branch being taken, which is just 50/50 a lot of the time, but sometimes influenced by interesting heuristics too: like the probability of icmp eq i32 %x, 0 is 0.375 instead of 0.5, and floats have a near zero chance of being NaN.

Plugging BlockFrequencyInfo into the loop vectorizer is straightforward, all we need to do is tell the pass manager that we want to access BlockFrequencyInfo from LoopVectorizePass:

PreservedAnalyses LoopVectorizePass::run(Function &F,
                                         FunctionAnalysisManager &AM) {
   ...
   BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
   ...
}

(BlockFrequencyAnalysis is the pass that computes the analysis result BlockFrequencyInfo, if you’re wondering why the names are different)

Then we can use it to lookup the relative frequencies of whatever block and work out the probability of it being executed in the loop:

uint64_t LoopVectorizationCostModel::getPredBlockCostDivisor(
    TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) {
  if (CostKind == TTI::TCK_CodeSize)
    return 1;

  uint64_t HeaderFreq =
      BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
  return HeaderFreq / BBFreq;
}

The frequencies returned from BlockFrequencyInfo are relative to the the entry block of a function. So if a block has a frequency of 50 and the entry block has a frequency of 100, then you can expect that block to execute 50 times for every 100 times the entry block is executed.

You can use this to work out probabilities of a block being taken in a function, so in this example that block has a 50/100 = 50% chance of being executed every time the function is executed. However this only works in the case that the CFG has no loops: otherwise a block may be executed more times than the entry block and we’d end up with probabilities greater than 100%.

If we want to calculate the probability of a block being executed inside a loop though, that’s fine since the loop vectorizer currently only vectorizes inner-most loops², i.e. loops that contain no other loops.

We can consider the frequencies of each block in the loop relative to the frequency of the header block. To give a brief loop terminology recap, the header is the first block inside the loop body which dominates all other blocks in the loop, and is the destination of all backedges. So the header is guaranteed to have a frequency greater than or equal to any other block in the loop — this invariant is important as we’ll see later.

A diagram showing off terminology for different parts of a loop

Then to calculate the probability of a block in a loop being executed, we divide the block frequency by the header frequency. To work out how much we should divide the cost of the scalar block by, we return the inverse of that.

Trying out this change on our sample loop, first we’ll see the debug output from BlockFrequencyInfo as it’s computed:

$ opt -p loop-vectorize -mtriple riscv64 -mattr=+v nested.ll -disable-output -debug
...
block-frequency-info: nested
 - entry: float = 1.0, int = 562949953421312
 - loop: float = 32.0, int = 18014398509481984
 - then.0: float = 16.0, int = 9007199254740992
 - then.1: float = 8.0, int = 4503599627370496
 - latch: float = 32.0, int = 18014398509481984
 - exit: float = 1.0, int = 562949953421312

loop is the header block and then.1 is the nested if block, and with BlockFrequencyInfo’s frequency we get a probability of 8/32 = 0.25. So we would expect then.1’s scalar cost to be divided by 4:

...
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c0, label %then.0, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c1, label %then.1, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep0 = getelementptr i32, ptr %p0, i32 %iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %x = load i32, ptr %gep0, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep1 = getelementptr i32, ptr %p1, i32 %x
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 0, ptr %gep1, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   br label %latch
LV: Found an estimated cost of 1 for VF 1 For instruction:   %iv.next = add i32 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %done = icmp eq i32 %iv.next, 1024
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %done, label %exit, label %loop
LV: Scalar loop costs: 2.
...
Cost for VF vscale x 4: 17 (Estimated cost per lane: 2.1)
...
LV: Selecting VF: 1.
LV: Vectorization is possible but not beneficial.

then.1s scalar cost is now 2/4 = 0, so the total cost of the scalar loop is now 2 and the loop vectorizer no longer decides to vectorize. If we try this out on 538.deepsjeng_r, we can see that it no longer vectorizes that loop in qsearch either. Success!

Screenshot of LNT showing a 6.82% improvement on 531.deepsjeng_r

Running it again on LNT showed a ~7% speedup in execution time. Not just as fast as GCC yet, but a welcome improvement for only a handful of lines of code.

Upstreaming

Now that we know the fix we want to land, we can start to think about how we want to upstream this into LLVM.

If we run llvm-lit --update-tests llvm/test/Transforms/LoopVectorize, we actually get quite a few unexpected test changes. One of the side effects of using BlockFrequencyInfo is that tail folded loops no longer discount the scalar loop if it wasn’t predicated to begin with. A tail folded loop is a loop where the scalar epilogue is folded into the vector loop itself by predicating the vector operations:

// non-tail folded loop:
// process as many VF sized vectors that fit in n
for (int i = 0; i < n - (n % VF); i += VF)
  x[i..i+VF] = y[i..i+VF];
// process the remaining n % VF scalar elements
for (int i = n - (n % VF); i < n; i++)
  x[i] = y[i];

// tail folded loop:
for (int i = 0; i < n; i += VF)
  x[i..i+VF] = y[i..i+VF] mask=[i<n, i+1<n, ..., i+VF-1<n];

However because this block is technically predicated due to the mask on the vector instructions, the loop vectorizer applied getPredBlockCostDivisor to the scalar loop cost even if the original scalar loop had no control flow in its body. BlockFrequencyInfo here can detect that if the block had no control flow, its probability of being executed is 1 and so the scalar loop cost isn’t made cheaper than it needs to be. I split off and landed this change separately, since it makes the test changes easier to review.

Now that the remaining changes in llvm/test/Transforms/LoopVectorize looked more contained, I was almost ready to open a pull request. I just wanted to quickly kick the tyres on llvm-test-suite with a few other targets, since this wasn’t a RISC-V specific change. The plan was to quickly collect some stats on how many loops were vectorized, check for any anomalies when compared to beforehand, and then be on our way:

$ cd llvm-test-suite
$ ninja -C build
...
[222/7278] Building C object External/...nchspec/CPU/500.perlbench_r/src/pp.c.o
FAILED: External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o 
/root/llvm-test-suite/build.x86_64-ReleaseLTO-a/tools/timeit --summary External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o.time /root/llvm-project/build/bin/clang -DDOUBLE_SLASHES_SPECIAL=0 -DNDEBUG -DPERL_CORE -DSPEC -DSPEC_AUTO_BYTEORDER=0x12345678 -DSPEC_AUTO_SUPPRESS_OPENMP -DSPEC_CPU -DSPEC_LINUX -DSPEC_LINUX_X64 -DSPEC_LP64 -DSPEC_SUPPRESS_OPENMP -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGE_FILES -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/dist/IO -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/cpan/Time-HiRes -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/cpan/HTML-Parser -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/ext/re -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/specrand -march=x86-64-v3 -save-temps=obj     -O3 -fomit-frame-pointer -flto -DNDEBUG   -w -Werror=date-time -save-stats=obj -save-stats=obj -fno-strict-aliasing -MD -MT External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -MF External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o.d -o External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -c /root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /root/llvm-project/build/bin/clang-19 -cc1 -triple x86_64-unknown-linux-gnu -O3 -emit-llvm-bc -flto=full -flto-unit -save-temps=obj -disable-free -clear-ast-before-backend -main-file-name pp.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -relaxed-aliasing -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64-v3 -debugger-tuning=gdb -fdebug-compilation-dir=/root/llvm-test-suite/build.x86_64-ReleaseLTO-a -fcoverage-compilation-dir=/root/llvm-test-suite/build.x86_64-ReleaseLTO-a -resource-dir /root/llvm-project/build/lib/clang/23 -Werror=date-time -w -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -vectorize-loops -vectorize-slp -stats-file=External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.stats -faddrsig -fdwarf2-cfi-asm -o External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -x ir External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.bc
1.      Optimizer
2.      Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-arithmetic;switch-to-lookup;no-keep-loops;hoist-common-insts;no-hoist-loads-stores-with-cond-faulting;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;switch-to-arithmetic;no-switch-to-lookup;keep-loops;no-hoist-common-insts;hoist-loads-stores-with-cond-faulting;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.bc"
3.      Running pass "loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>" on function "Perl_pp_coreargs"
 #0 0x0000556ff93ab158 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/root/llvm-project/build/bin/clang-19+0x2d5c158)
 #1 0x0000556ff93a8835 llvm::sys::RunSignalHandlers() (/root/llvm-project/build/bin/clang-19+0x2d59835)
 #2 0x0000556ff93abf01 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007f305ce49df0 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0)
 #4 0x0000556ffaa0dbfb llvm::LoopVectorizationCostModel::expectedCost(llvm::ElementCount) (/root/llvm-project/build/bin/clang-19+0x43bebfb)
 #5 0x0000556ffaa22a0d llvm::LoopVectorizationPlanner::computeBestVF() (/root/llvm-project/build/bin/clang-19+0x43d3a0d)
 #6 0x0000556ffaa36f3b llvm::LoopVectorizePass::processLoop(llvm::Loop*) (/root/llvm-project/build/bin/clang-19+0x43e7f3b)
 #7 0x0000556ffaa413eb llvm::LoopVectorizePass::runImpl(llvm::Function&) (/root/llvm-project/build/bin/clang-19+0x43f23eb)
 ...
...

A crash when building for X86. No assertion message, but a backtrace that points to the loop vectorizer cost model. Unfortunately this did not turn out to be simple to debug and instead turned into a whole other ordeal, so I’ll leave the details of that rabbit hole to the next post. But in the meantime, here are some hints if you want to guess what went wrong:

The crash stems from a SIGFPE signal
It only occurs when building on X86. Building on AArch64 is unaffected, even when cross-compiling to X86
It only occurs with LTO

Hopefully this also gives a bit of insight into the type of upstream work that we carry out at Igalia. If you have an LLVM or RISC-V project that we could help with, feel free to reach out.

The scalar loop is also modeled in VPlan, but currently costed with the legacy cost model and not the VPlan itself. This is another load bearing TODO. ↩
Whilst not enabled default, there is experimental support for outer loop vectorization in the VPlan native path. ↩