Custom RISC-V instructions in verilog

May 01, 2023

Post backdated to the time of the project.

This was a course project for my Master's Degree in Computer Engineering at the University of Maryland.

For my verilog course project, I chose a recent paper and extended it/built on it. The paper was Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions.

The paper

The paper's authors’ big idea is roughly that vector instructions are great, but only if the ISA contains one that is useful for you. There are lots of different variations on vector instructions to try to meet different needs. The authors reference the (then in-progress) RISC-V V standard (vector extension), which at the time had about 750 different instuctions in gnu implementation.

The authors' idea is that a CPU could have some FPGA fabric next to the ALU, and the user could load their own custom instructions into the FPGA. This would allow the user to define their own SIMD instructions that are tailored to their workload.

In order to demonstrate this approach, the authors created a new RISC-V softcore in verilog, added a new non-standard vector instruction format, and created some custom SIMD instructions. They then benchmarked their custom instructions against a pure C software implementation. Their example custom instructions were a merge sorting instruction and a prefix sum instruction.

In order to benchmark first had to patch the riscv-gnu-toolchain to allow custom instructions. They then wrote some C++ with some inline assembly using the custom instructions. They compiled the C++ using the patched toolchain to get a binary. They then wrote a verilog testbench that loads the binary into simulated memory and executes it in their custom RISC-V softcore. They measured the cycle count and instruction count of the custom instructions and compared it to a pure C software implementation running in the same RISC-V softcore.

All of this is in their paper and the source code is published on github.

The authors promise that “efficient FPGA-based instructions can be developed with few low-level lines of code”. I found that it’s a bit more than a few lines, but it is surprisingly approachable!

My work

I extended the paper by adding two new instructions in the same form as the authors and benchmarking them against a pure C software implementation in the same manner as the original paper.

I made a selection sort instruction (the professor suggested that one) and a MD5 hash instruction (I wanted to do something more interesting).

I presented the design of my custom instructions and demonstrated the benchmarks live for my final presentation in the class.

All of my code is on github. Note that I took the paper code and modified it for my experiments, so much of the code is taken directly from the paper authors' repo.

Here are the slides used in the presentation and the benchmark results. I've split the slides into several sets. Each can be clicked through interactively.

Selection sort instruction

The authors implemented their merge sort in two custom instructions. One sorts 8 ints and the other merges two sets of 8 ints. I replaced the first instruction with a selection sort instruction that sorts 8 ints.

This will kind of obviously be worse than merge sort, but it's easy to understand and implement, and the professor insisted I start with something easy.

This series of slides demonstrates how to implement selection sort using compare-and-swap modules. Note that I'm not talking about the atomic instruction, but a simple hardware module that swaps two ints if one is greater than the other.

As you click through the slides, you can see that the ints surrounded by red boxes are the ones being compared and swapped. At the end of these slides, the lowest int is in the top box.

The grid illusion is a bonus side effect of my questionable slide design choices.

Read and click to the next slide to see the segue into the pipeline implementation of the selection sort instruction.

These slides visualize how the instruction can start searching for the items in lower positions before finishing finding the smallest item. Each color of boxes represents compare-and-swaps that can happen in parallel.

Note that this still isn't as good as the merge sort implemented by the paper authors. Their instruction uses 11 cycles to sort 8 ints, while this selection sort uses 13 cycles.

Here are the results of the benchmark. The selection sort instruction is compared to the merge sort instruction from the paper and qsort.

N: 16
MergeSIMDSort results:     cycle count: 281,  instruction count: 160,  CPI: 1.75, 34.16 MB/s@150MHz
SelectionSIMDSort results: cycle count: 286,  instruction count: 160,  CPI: 1.78, 33.56 MB/s@150MHz
QSort results:             cycle count: 3434, instruction count: 2087, CPI: 1.64,  2.79 MB/s@150MHz
Speedup from MergeSIMDSort over QSort:     12.22
Speedup from SelectionSIMDSort over QSort: 12.00
Error code: 0!

N: 128
MergeSIMDSort results:     cycle count: 5445,  instruction count: 2697,  CPI: 2.01, 14.10 MB/s@150MHz
SelectionSIMDSort results: cycle count: 5666,  instruction count: 2697,  CPI: 2.10, 13.55 MB/s@150MHz
QSort results:             cycle count: 48596, instruction count: 29839, CPI: 1.62,  1.58 MB/s@150MHz
Speedup from MergeSIMDSort over QSort:     8.92
Speedup from SelectionSIMDSort over QSort: 8.57
Error code: 0!

N: 1024
MergeSIMDSort results:     cycle count: 60723,  instruction count: 29111,  CPI: 2.08, 10.11 MB/s@150MHz
SelectionSIMDSort results: cycle count: 60832,  instruction count: 29111,  CPI: 2.08, 10.09 MB/s@150MHz
QSort results:             cycle count: 543405, instruction count: 335074, CPI: 1.62,  1.13 MB/s@150MHz
Speedup from MergeSIMDSort over QSort:     8.94
Speedup from SelectionSIMDSort over QSort: 8.93
Error code: 0!

My selection sort is much faster than qsort in the benchmarks, but it's still slower than the merge sort from the paper, as expected.

MD5 instruction

After finishing the selection sort experiment, I wanted to do something a bit more interesting, but still approachable for a novice like myself. I chose to implement an MD5 hash instruction. I know that MD5 is not secure for cryptographic purposes, but it's still useful for error checking.

These two slides visualize which part of the MD5 algorithm I implemented in hardware. I specifically implemented the processing for a single 512-bit chunk of a message.

This series of slides breaks down the processing for one chunk into the specified 64 rounds and shows how the rounds are each just slight variations on the same theme. I essentially only need to implement the four functions shown in the slides and then wire them up so that they are executed in the correct order.

Each round is implemented as a taking a cycle, so in total a single chunk takes 64 cycles to process. This might be a bit optimistic since the function of a round is more complex than your average single cycle instruction.

Here are the results of the benchmark. The MD5 instruction is compared to a pure C software implementation for processing a single chunk.

Calculating the MD5 hash of "Hello ENPM808! This is my MD5 implementation in verilog"
Pre-padded message block:
6c6c6548 4e45206f 30384d50 54202138 20736968 6d207369 444d2079 6d692035 6d656c70 61746e65 6e6f6974 206e6920 69726576 80676f6c 000001b8 00000000

MD5 normal C result:  3bb07e3f d905e922 f4eb2dad f41c8866
MD5 normal C took  2359 cycles and 2162 instructions. CPI: 1.09

MD5 with SIMD result: 3bb07e3f d905e922 f4eb2dad f41c8866
MD5 with SIMD took   78 cycles and   14 instructions. CPI: 5.57

The MD5 instruction vastly outperforms the pure C software implementation.

This was a learning project for me and I'm happy with the results. I learned a lot about verilog, RISC-V, and SIMD instructions. I also learned a lot about the MD5 algorithm and how it works. I know that my particular instructions may not be useful in practice, but the process of learning how to create them and how they fit within the rest of the CPU was valuable to me.