A few months ago Piyou Chen and I landed a sizeable change to how the LLVM RISC-V backend generates code for the vector extension.

The gist is that vector instructions on RISC-V don’t encode the vector length or element type in the instruction itself, but instead read them from two registers vl and vtype that are configured with vset[i]vli:

# setup vl and vtype so we're working with <a0 x i32> vectors
vsetivli zero, a0, e32, m1, ta, ma
vadd.vv v8, v10, v12

So in LLVM we have a pass called RISCVInsertVSETVLI that inserts the necessary vset[i]vlis so that vl and vtype are setup correctly for each pseudo instruction:

%a = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, %avl, 5 /*e32*/, 3 /*ta, ma*/

--[RISCVInsertVSETVLI]-->

$x0 = PseudoVSETVLI %avl, 209 /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
%rd = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, implicit $vl, implicit $vtype

Previously this was run before register allocation on virtual registers, but it had some drawbacks:

  1. PseudoVSETVLI implicitly defines $vl and $vtype whilst the pseudos implicitly use them, which turns every PseudoVSETVLI into a scheduling boundary. For example, we can’t move this high-latency vle32.v before the vadd.vv because they have different vtypes:

    $x0 = PseudoVSETVLI ... /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
    %x = PseudoVADD_VV_M2 ..., implicit $vl, implicit $vtype
    /* Can't move the vle32.v up past here! */
    $x0 = PseudoVSETVLI ... /*e32, m8, ta, ma*/, implicit-def $vl, implicit-def $vtype
    %y = PseudoVLE32_V_M8 ..., implicit $vl, implicit $vtype
    
  2. It’s not practically possible to emit further vector pseudos during or after regsiter allocation, since we would need to keep track of the current vl and vtype and potentially emit new vsetvlis. This is a blocker for vector rematerialization and partial spilling: we’re stuck using whole register loads and stores when spilling since they don’t use vl or vtype, even though we might not need an entire vector register for a small vector.

The solution we ended up with was to insert the vsetvlis after register allocation instead:

%rd = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, %avl, 5 /*e32*/, 3 /*ta, ma*/

--[Allocate vector registers]-->

$v8m2 = PseudoVADD_VV_M2 $v8m2, $v8m2, $v10m2, %avl, 5 /*e32*/, 3 /*ta, ma*/

--[RISCVInsertVSETVLI]-->

$x0 = PseudoVSETVLI %avl, 209 /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
$v8m2 = PseudoVADD_VV_M2 $v8m2, $v8m2, $v10m2, implicit $vl, implicit $vtype

--[Allocate scalar registers]-->

...

The trick here is that we split register allocation into two, doing one pass for vectors first and then a second one for scalars. This means that RISCVInsertVSETVLI can be run in between the two, operating on physical vector registers but whilst still able to create virtual scalar registers (It might need to write to a register to set the AVL to VLMAX: vsetvli a0, x0, ...).

The other big change needed was to update RISCVInsertVSETVLI to operate on LiveIntervals now that it was out of SSA form. This isn’t so much an issue with the physical vector registers, but doing the dataflow analysis on the scalar AVL registers proved much trickier.

These changes are interesting in their own right, but in this blog post I want to focus on the logistics that went into landing them instead. For reference, there were only around 500 lines in the initial version of the patch:

$ git diff --stat --relative -- lib
 lib/CodeGen/LiveDebugVariables.cpp      |   2 +-
 lib/CodeGen/LiveDebugVariables.h        |  68 ------
 lib/CodeGen/RegAllocBasic.cpp           |   2 +-
 lib/CodeGen/RegAllocGreedy.cpp          |   2 +-
 lib/CodeGen/VirtRegMap.cpp              |   2 +-
 lib/Target/RISCV/RISCVInsertVSETVLI.cpp | 394 ++++++++++++++++++++++++--------
 lib/Target/RISCV/RISCVTargetMachine.cpp |  14 +-
 7 files changed, 318 insertions(+), 166 deletions(-)

But getting this into the main branch took 20 something patches: easily one of the longest running efforts I’ve been a part of during my time at Igalia.

Putting the test diffs front and centre

As you might expect with a significant change like this, it was initially put behind a flag that was disabled by default, hence why the initial patch was so small. Enabling it by default revealed how nearly every bit of vector code under the sun ended up being affected:

$ git diff --stat --relative -- test
test/CodeGen/RISCV/O0-pipeline.ll                  |    2 +-
 test/CodeGen/RISCV/O3-pipeline.ll                  |    2 +-
 .../early-clobber-tied-def-subreg-liveness.ll      |   18 +-
 test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll   |   26 +-
 
 ...
 
 test/CodeGen/RISCV/rvv/vssubu-vp.ll                |    4 +-
 test/CodeGen/RISCV/rvv/vtrunc-vp.ll                |   22 +-
 test/CodeGen/RISCV/rvv/vuitofp-vp.ll               |   28 +-
 test/CodeGen/RISCV/rvv/vxrm-insert.ll              |   24 +-
 test/CodeGen/RISCV/rvv/vxrm.mir                    |    6 +-
 test/CodeGen/RISCV/rvv/vzext-vp.ll                 |    2 +-
 test/CodeGen/RISCV/spill-fpr-scalar.ll             |   24 +-
 test/CodeGen/RISCV/srem-seteq-illegal-types.ll     |    4 +-
 331 files changed, 14868 insertions(+), 14079 deletions(-)

Since we ultimately wanted post-register allocation RISCVInsertVSETVLI to be the only configuration going forward, we would have to confront these test diffs at some point. Disabling it behind a flag would have just kicked the massive diff further down the road to a separate patch that enables it by default.

Landing the changes enabled by default also keeps the code changes and test diffs in sync, which makes it easier to correlate the two when reading through the commit history. And in our case it was better to review the test diffs whilst the mental model of the code was still fresh in our heads.

We still kept the flag since it’s useful to have a fail-safe with risky changes like these, but we switched the flag over to be enabled by default and were now left with a 30k-ish line test diff.

A large test diff isn’t necessarily unreviewable provided the changes are boring, but unfortunately for us our changes were actually pretty interesting:

diff --git a/test/CodeGen/RISCV/rvv/vsetvli-insert.ll b/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
index 29ce7c52e8fd..6fc3e3917a5c 100644
--- a/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
+++ b/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
@@ -352,15 +352,13 @@ entry:
 define <vscale x 1 x double> @test18(<vscale x 1 x double> %a, double %b) nounwind {
 ; CHECK-LABEL: test18:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    vsetivli zero, 6, e64, m1, tu, ma
-; CHECK-NEXT:    vmv1r.v v9, v8
-; CHECK-NEXT:    vfmv.s.f v9, fa0
-; CHECK-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vfadd.vv v8, v8, v8
+; CHECK-NEXT:    vsetivli a0, 6, e64, m1, ta, ma
+; CHECK-NEXT:    vfadd.vv v9, v8, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m1, tu, ma
 ; CHECK-NEXT:    vfmv.s.f v8, fa0
+; CHECK-NEXT:    vfmv.s.f v9, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vfadd.vv v8, v9, v8
+; CHECK-NEXT:    vfadd.vv v8, v8, v9
 ; CHECK-NEXT:    ret

In this one diff we have registers being renamed, instructions being shuffled around different vsetvlis being inserted. In other places we were seeing changes to the number of spills, sometimes more and sometimes less.

All of these observable side effects are intertwined with the multiple things we needed to change in our patch. And when there’s multiple functional changes in a patch, it’s not as obvious as to what line of code caused what line of the test diff to change. Compound that over 300 something test files and you have a patch that no one will be able to realistically review.

Splitting

The main tool for dealing with these massive patches is to split it up so you’re only changing one thing at a time. The first such change split off was the register allocation split:

  • [RISCV] Split regalloc between RVV and other (#72096)
  • [RISCV] default enable splitting regalloc between RVV and other (#72950)

This was nice and self contained, and the test diff was only around 130 lines. It was a natural first choice to split out from the patch since it was tangential to RISCVInsertVSETVLI itself, and even though this didn’t make a dent in the overall test diff we could rule out the register allocation split as one of the forces at play. Getting it out of the way early allowed us to stop thinking about it and save our mental effort for other things.

NFCs

The next step was to go ahead and carve out the NFC changes: No Functional Change changes. This is a common practice in LLVM and other projects where changes that don’t (intentionally) affect any observable behaviour are landed as patches labelled as NFC: things like refactoring or adding test cases.

Knowing that a patch isn’t supposed to have functional changes helps reasoning about it, and if it did end up accidentally changing some behaviour (which happens all the time) then it’s easier to bisect. We were able to find a good few NFC bits from the initial patch:

  • [NFC][LLVM][CodeGen] Move LiveDebugVariables.h into llvm/include/llvm/CodeGen (#88374)
  • [RISCV] Add statistic support for VSETVL insertion pass (#78543)
  • [RISCV] Check dead flag on VL def op in RISCVCoalesceVSETVLI. NFC (#91168)
  • [RISCV] Use an enum for demanded LMUL in RISCVInsertVSETVLI. NFC (#92513)
  • [NFC][RISCV] Keep AVLReg define instr inside VSETVLInfo (#89180)

Some of these NFC patches were just a matter of tightening existing assumptions about the code by adding invariants:

  • [RISCV] Add invariants that registers always have definitions. NFC (#90587)

Whilst others were about removing configurations to reduce the number of code paths:

  • [RISCV] Remove -riscv-insert-vsetvl-strict-asserts flag (#90171)

And as the code diff became smaller, some of the edge cases became more apparent. We refactored things to make them more explicit, which made it more clear when we were dealing with them in the main patch:

  • [RISCV] Split out VSETVLIInfo AVL states to be more explicit (#89964)

All of the above had the same common goal, doing the thinking up front to simplify things further down the line. But whilst it did make the code changes easier to reason about, we still had the very large, very complicated test diff to deal with.

Absorbing test diffs

At some point the NFC patches started to dry up, and we had to start tweezing apart the tangle of changes in the test diff.

Ideally each patch would contain only functional change, but that’s easier said than done, especially on a project like LLVM with so many moving parts. Often you’ll need to touch multiple passes to make one codegen change, and a lot of the time these changes will be interlinked: you need to change something in pass A, but on its own it results in regressions until you land a change in pass B.

We began by ignoring those tricky changes for now, and instead looked for things that could be landed as a isolated, incremental, net-positive (or at worst neutral) changes.

  • [RISCV] Convert implicit_def tuples to noreg in post-isel peephole (#91173)
  • [RISCV] Move RISCVDeadRegisterDefinitions to post vector regalloc (#90636)

Some of the test diffs were due to deficiencies in the existing code that just happened to surface when combined with the changes in our patch. For example, changes in the scheduling showed how we sometimes failed to merge vsetvlis when they were in a different order.

  • [RISCV] Unify getDemanded between forward and backwards passes in RISCVInsertVSETVLI (#92860)

Shuffling up the pipeline

A significant chunk of the test diffs were due to RISCVInsertVSETVLI being moved after other passes. For example, moving it past RISCVInsertReadWriteCSR and RISCVInsertWriteVXRM meant that vsetvlis were now inserted after csrwis and fsrmis.

diff --git a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
index 5b271606f08ab..aa11e012af201 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
@@ -15,8 +15,8 @@ define <vscale x 1 x half> @vp_ceil_vv_nxv1f16(<vscale x 1 x half> %va, <vscale
 ; CHECK-NEXT:    vfabs.v v9, v8, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, mu
 ; CHECK-NEXT:    vmflt.vf v0, v9, fa5, v0.t
-; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    fsrmi a0, 3
+; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    vfcvt.x.f.v v9, v8, v0.t
 ; CHECK-NEXT:    fsrm a0
 ; CHECK-NEXT:    vfcvt.f.x.v v9, v9, v0.t

We could take these relatively boring 9000-ish lines of test diff separately by moving RISCVInsertVSETVLI just after these passes.

  • [RISCV] Move RISCVInsertVSETVLI after CSR/VXRM passes (#91701)

This was one of the patches where the test diff wasn’t actually a win: It was just important to be able to understand the changes in isolation.

Splitting the pass itself

Another trick we used to isolate the diffs was to split the pass itself and move part of it past register allocation.

  • [RISCV] Separate doLocalPostpass into new pass and move to post vector regalloc (#88295)

This let us absorb the diffs for a specific function, doLocalPostpass, which were small enough to be individually reviewable.

This was unfortunately one of the functional changes that introduced a temporary regression, which wouldn’t be fixed until the rest of the work was landed. We justified it though since we were able to understand where it was coming from and how the final patch would fix it.

Big change, small diff/Small change, big diff

At this stage, the last remaining change was to do the actual move to after register allocation, which meant changing the pass to work on LiveIntervals since we would be out of SSA.

At first glance this appeared to be one indivisible chunk of work, but it turned out we could defer the changes from the pass reordering till later by only moving it up past phi elimination. That was enough to get us out of SSA and into LiveIntervals, which was ultimately the tricky bit that we wanted to be sure we got right.

  • [RISCV] Move RISCVInsertVSETVLI to after phi elimination (#91440)

This patch had most of the code changes, but a minimal test diff of around 300 lines. Being able to land the riskiest part with a small test diff allowed us to see the code that ended up being affected, which would have been completely hidden otherwise in the original 30k line diff.

After this, we were just left with the original patch which was now just a matter of moving RISCVInsertVSETVLI past register allocation:

  • [RISCV] Support postRA vsetvl insertion pass (#70549)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 5aab138dae408..d9f8222669cab 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -399,6 +406,8 @@ bool RISCVPassConfig::addRegAssignAndRewriteFast() {
 bool RISCVPassConfig::addRegAssignAndRewriteOptimized() {
   addPass(createRVVRegAllocPass(true));
   addPass(createVirtRegRewriter(false));
+  if (EnableVSETVLIAfterRVVRegAlloc)
+    addPass(createRISCVInsertVSETVLIPass());

Looking the numbers we were now were down to a 19k line test diff, still a considerable size but now the code changes were tiny.

This is the converse to #91440: we have a patch with minimal code changes but a large test diff, and this is also much easier to review. We can have confidence that all changes in the test diff come from a single functional change, which in our case this was the interaction of moving RISCVInsertVSETVLI past register allocation and unblocking the machine scheduler.

When and where to split

None of the above splits were obvious to us at first. For a change like this, the only realistic starting point was to just implement it all at first. Once we had a picture of what the destination should look like, we could start to identify some initial things to split off.

From then on out it’s mostly a matter of splitting, rebasing and repeating the cycle. New incision points will become clearer as both the code and test diffs are whittled down, and hopefully the cycle will eventually lead you to a patch small enough that you can digest everything that’s going on inside of it.

Splitting up big patches is not a new or rare idea, and it’s done all the time in LLVM. Take for example the ongoing work to replace getelementptr with ptradd, being split up into several smaller changes to canonicalize and massage the GEPs into shape before a new instruction is added:

  • [InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
  • [IR] Change representation of getelementptr inrange (#84341)
  • [ConstantFolding] Canonicalize constexpr GEPs to i8 (#89872)

Likewise the epic journey to remove debug intrinsics was carried out over multiple RFCs:

Being able to submit reviewable and landable patches is half the battle of collaborative open source software development. There’s no one-size-fits-all strategy and it often ends up being much more difficult than writing the actual code. But if you’re working on something big, hopefully some of the ideas in this post will be of inspiration to you.

Many thanks to Piyou and the reviewers who helped drive this forward, unblocking the way for further enhancements to vector codegen in the RISC-V backend.