# Vecim: A 289.13GOPS/W RISC-V Vector Co-Processor with Compute-in-Memory Vector Register File for Efficient High-Performance Computing

Yipeng Wang, Mengtian Yang, Chieh-pu Lo, Jaydeep P. Kulkarni
University of Texas at Austin, Circuit Research Lab







#### **Outline**

#### Motivation

- Efficiency gap of HPC
- SRAM Compute-in-memory(CIM) application challenges

#### **■** Proposed Vecim Architecture

- Overall architecture
- CIM vector register file (VRF) and multiplication scheme
- Data path and data flow
- Silicon prototype measurements
- Summary

# **Motivation 1: Efficiency Gap of HPC**



## **Motivation 1: Efficiency Gap of HPC**

$$Efficiency = OPS/W = \frac{OP}{E_{compute} + E_{others}}$$

$$= \frac{E_{compute}}{E_{compute} + E_{others}} / \frac{E_{compute}}{OP} \downarrow$$

Increase the proportion of Compute's energy

- Specialized instruction
- Domain specific hardware
- Datapath and memory optimization

• . . .

Improve the average <u>Energy</u> consumption of compute <u>OP</u>eration

- Technology scaling
- Lower precision
- Sparsity
- **-** ...

# **Motivation 1: Efficiency Gap of HPC**

$$Efficiency = OPS/W = \frac{OP}{E_{compute} + E_{others}}$$

$$= \frac{E_{compute}}{E_{compute} + E_{others}} / \frac{E_{compute}}{OP} \downarrow$$

Increase the proportion of Compute's energy

This work

- Specialized instruction
- Domain specific hardware
- Datapath and memory optimization
- In memory vector processing

Improve the average energy consumption of compute operation

- Technology scaling
- Lower precision
- Sparsity
- Reusing SRAM for compute

Explore architectural opportunity CIM provide for general purpose HW.

#### Motivation 2: SRAM CIM application challenges

Large area footprint

Low robustness: PVT variation of custom cells, IR drop

Low frequency: Longer WL/BL, large adder tree

Accuracy loss: ADC, FP conversion/limited window

This work:



Use foundry 8T cell

(Modified layout only for illustration purpose)



Intact cell array
Pipelined digital CIM
Near memory FP support

# Motivation 2: SRAM CIM application challenges

Large area footprint

Low robustness Low frequency Accuracy loss

Low programmability



Use foundry 8T cell

(Modified layout only for illustration purpose)



Intact cell array
Pipelined digital CIM
Near memory FP support

Embed CIM in general purpose architecture / ISA; Show efficiency improvement

- → RISCV vector processor
- → large memory capacity in register files
- → Target matrix multiplication
- → Our key contribution

#### **Outline**

- Motivation
  - Efficiency gap of HPC
  - SRAM Compute-in-memory(CIM) application challenges

#### ■ Proposed Vecim Architecture

- Overall architecture
- CIM vector register file (VRF) and multiplication scheme
- Data path and data flow
- Silicon prototype measurements
- Summary

8 of 36

#### **Vecim Overall Architecture**

- ☐ Based on open sourced [Ara, TVLSI, 2020]
- ☐ Instructions from scalar CPU
- ☐ 64bit/lane/cycle bandwidth DRAM
- □ Vector Load-Store Unit (LSU)
- □ Vector Slide Unit



#### **Vecim Overall Architecture**

#### **Our Innovations**

- □A 1R1W SRAM Vector Register File (VRF)
  - INT8/BF16/FP16 all-digital in-memory multiplication and near-mem addition
  - Double-rate-bit-parallel multiplication
  - Specialized instruction extension



#### **Vecim Overall Architecture**

#### **Our Innovations**

- □A 1R1W SRAM Vector Register File (VRF)
  - INT8/BF16/FP16 all-digital in-memory multiplication and addition
  - Double-rate-bit-parallel multiplication
  - Specialized instruction extension
- □A dedicated vector sequencer
  - Light-weight out-of-order execution





- Decoupled read/write ports for higher throughput
- Lower Vmin to be compatible with core logic



Traditional CIM dataflow for Neural Networks

This work

- Activations are bit decomposed to WL/BL/LCU/Banks, either bit-serial or bit-parallel.
- Vector processor only support RF with bit-aligned data layout. New dataflow needed.

Double-rate-bit-parallel multiplication

☐ Bit-wise parallel multiplication

Bitwise AND: BL multiplication



P<sub>70</sub> P<sub>60</sub> P<sub>50</sub> P<sub>40</sub> P<sub>30</sub> P<sub>20</sub> P<sub>10</sub> P<sub>00</sub>

Double-rate-bit-parallel multiplication

- ☐ Bit-wise parallel multiplication
- ☐ In memory shift

Bitwise AND: BL multiplication



Double-rate-bit-parallel multiplication

- Bit-wise parallel multiplication
- ☐ In memory shift
- ☐ Need 8 cycles; 4 lanes, 8 banks each, 32op/cycle peak
- Same throughput as 64b FPU per lane P75 P65 P55 P45 P35 P25 P15 P05









P<sub>70</sub> P<sub>60</sub> P<sub>50</sub> P<sub>40</sub> P<sub>30</sub> P<sub>20</sub> P<sub>10</sub> P<sub>00</sub>

P<sub>71</sub> P<sub>61</sub> P<sub>51</sub> P<sub>41</sub> P<sub>31</sub> P<sub>21</sub> P<sub>11</sub> P<sub>01</sub>

P<sub>72</sub> P<sub>62</sub> P<sub>52</sub> P<sub>42</sub> P<sub>32</sub> P<sub>22</sub> P<sub>12</sub> P<sub>02</sub>

P<sub>73</sub> P<sub>63</sub> P<sub>53</sub> P<sub>43</sub> P<sub>33</sub> P<sub>23</sub> P<sub>13</sub> P<sub>03</sub>

P<sub>74</sub> P<sub>64</sub> P<sub>54</sub> P<sub>44</sub> P<sub>34</sub> P<sub>24</sub> P<sub>14</sub> P<sub>04</sub>

International Solid-State Circuits Conference

Double-rate-bit-parallel multiplication

Cycle 1

Accumulate two consecutive results in one cycle.

P71 P61 P51 P41 P31 P21 P11 P01 one cycle.

P72 P62 P52 P42 P32 P22 P12 P02

13 bit adder 16 bit register

P73 P63 P53 P43 P33 P23 P13 P03

P74 P64 P54 P44 P34 P24 P14 P04

X7 X6 X5 X4 X3 X2 X1 X0 Y6 Y5 Y4 Y3 Y2 Y1 Y0 Y6 Y5 W Y7

**Bitwise AND** 

13 bit adder

14 bit register

P<sub>70</sub> P<sub>60</sub> P<sub>50</sub> P<sub>40</sub> P<sub>30</sub> P<sub>20</sub> P<sub>10</sub> P<sub>00</sub>

P<sub>76</sub> P<sub>66</sub> P<sub>56</sub> P<sub>46</sub> P<sub>36</sub> P<sub>26</sub> P<sub>16</sub> P<sub>06</sub>

P<sub>77</sub> P<sub>67</sub> P<sub>57</sub> P<sub>47</sub> P<sub>37</sub> P<sub>27</sub> P<sub>17</sub> P<sub>07</sub>

Double-rate-bit-parallel multiplication



☐ 4 Cycles in total. 64op/cycle peak, 2X

13 bit adder

16 bit register



Cycle 2



 $X_7 X_6 \cdots X_0$ 

 $Y_5$   $Y_4$  ...  $Y_6$ 

CIM VRF Circuit and dataflow

- ☐ Copy operand A to CIM BIT 0
  - In-memory inverted copy operation





CIM VRF Circuit and dataflow

- ☐ Copy operand A to CIM BIT 0
  - In-memory inverted copy operation
- ☐ Keep operand B at knode



CIM VRF Circuit and dataflow

- ☐ Double-rate-bit-parallel multiplication
  - In-memory shift
  - BL multiplication





CIM VRF Circuit and dataflow

- ☐ Double-rate-bit-parallel multiplication
  - In-memory shift
  - BL multiplication

Critical delay path similar with SRAM read. Similar IR drop.



Floating point support with near memory adders



PPA and area overhead

~40% delay overhead solution: See back up slides!





## **Vector Sequencer**



#### asm capture for conv2d



- 3 queues for memory load/store, CIM related, and other arithmetic instruction
- Light-weight out-of-order execution

#### **Outline**

- Motivation
  - Efficiency gap of HPC
  - SRAM Compute-in-memory(CIM) application challenges
- Proposed Vecim Architecture
  - Overall architecture
  - CIM vector register file (VRF) and multiplication scheme
  - Data path and data flow
- Silicon prototype measurements
- Summary

#### **Emerging Matrix Multiplication Application**

- □Deep learning: CNN, Transformer
- □Combinatorial optimization: SAT, Ising, ILP
  - Solve Max-SAT using matrix mul. [David Warde-Farley, Deepmind, Arxiv, 2023]
- ☐ Security: Kyber, CKKS, TFHE
  - Vector-matrix multiplication in RLWE; TFHE key switching
- ☐Graphics: NeRF, 3DGS
  - Matrix multiplication in NeRF: NLP; 3DGS: View transformation

#### **GEMM / MVM algorithm and Instruction extension**





Matrix multiplication:



Reuse A: Instructions bottleneck

V1

#### **GEMM / MVM algorithm and Instruction extension**

#### Instruction extension

FP16/BF16:



INT8:



**Next MAC ins:** 

#### Throughput measurement



Average throughput measurements running matrix multiplication tasks.

#### Efficiency measurement with average power



<sup>\*</sup>the min and max point corresponding to power  $\propto$  tech<sup>2</sup> (pessimistic) and  $\propto$  tech (optimistic) normalization. Power measurement is the average of the power curve running matrix multiplication tasks. This work does not count CPU power.

#### Die shot and chip summary



| Chip summary                        |                         |  |  |  |
|-------------------------------------|-------------------------|--|--|--|
| Technology                          | 65nm                    |  |  |  |
| Supply voltage                      | 1V                      |  |  |  |
| Die size                            | 2x2 mm <sup>2</sup>     |  |  |  |
| Frequency                           | 250MHz                  |  |  |  |
|                                     | All                     |  |  |  |
| Precision                           | (INT8/BF16/FP16 in      |  |  |  |
|                                     | memory)                 |  |  |  |
| VRF size                            | 4 lanes x 8 banks x 4kb |  |  |  |
| Bit cell area                       | area 1.658 um²          |  |  |  |
| Performance                         | 31.8 / 25.3 GOPS        |  |  |  |
| Energy efficiency                   | 289.13 / 230.10         |  |  |  |
|                                     | G(FL)OPS/W              |  |  |  |
| Area efficiency 7.95 / 6.33 GOPS/mm |                         |  |  |  |

#### **Comparison table**

<sup>\*3:</sup> This is optimistic since there's no pads. \*4: calculated by 1:1 mul/add efficiency x reported CPU mode power.

|                                                           | Ara [2]                     | ISSCC 2019 [5]                                      | VLSI 2023<br>[6]                                  | This work                     |
|-----------------------------------------------------------|-----------------------------|-----------------------------------------------------|---------------------------------------------------|-------------------------------|
| Technology                                                | GF 22nm                     | TSMC 28nm                                           | TSMC 65nm                                         | TSMC 65nm                     |
| ISA                                                       | RISCV                       | Customed                                            | Customed                                          | RISCV                         |
| Category                                                  | General purpose processor   | Customed CIM design with general purpose operations |                                                   | General purpose processor     |
| CIM type                                                  | -                           | Custom 8T                                           | Custom 8T/9T                                      | Foundry 8T                    |
| Bit precision                                             | All                         | All (potentially)                                   | INT8/32                                           | All (INT8/BF16/FP16 enhanced) |
| Design level                                              | Processor (simulation)      | Macro + ctrl                                        | Macro + ctrl                                      | Co-processor                  |
| Peak performance INT8/FP16*1 (CLK frequency is different) | 78.4 / 39.2<br>G(FL)OPS     | 40.5 / ~0.5<br>G(FL)OPS                             | 2.3 <sup>*4</sup> / X<br>GOPS                     | 31.8 / 25.3<br>G(FL)OPS       |
| Throughput<br>(GOPS/MHz)                                  | 0.063 / 0.031               | 0.085 / ~                                           | 0.012 / X                                         | 0.127 / 0.101                 |
| Energy efficiency INT8/FP16 (power normalized to 65nm*2)  | 34.64 / 17.32<br>G(FL)OPS/W | 439.78 / ~5<br>G(FL)OPS/W                           | 473.35 (GP mode)<br>7620 (DNN mode) / X<br>GOPS/W | 289.13/230.10<br>G(FL)OPS/W   |

<sup>\*1: 256</sup>x256 size matrix multiplication, unless noted; \*2: Use power ∝ tech², this may be pessimistic for advanced nodes.

#### **Outline**

- Motivation
  - Efficiency gap of HPC
  - SRAM Compute-in-memory(CIM) application challenges
- Proposed Vecim Architecture
  - Overall architecture
  - CIM vector register file (VRF) and multiplication scheme
  - Data path and data flow
- Silicon prototype measurements
- Summary

#### **Summary**

- Demonstrate SRAM Compute-in-memory in RISCV vector processor register file.
- The 1R1W 8T SRAM register file uses foundry cell with digital CIM and near memory compute unit.
- Achieves 289.13GOPS/W and 7.95GOPS/mm² for INT8, 230.10GFLOPS/W and 6.33 GOPS/mm² for FP16 precision.

## Acknowledgments

- TSMC university shuttle support
- UT ECE iMAGINE consortium

# Thank you for your attention!

#### **Demo system**









#### Solution for slides 24

- Write to CIM bit happens next cycle
- Inverted copy is guaranteed not happen in two consecutive cycles.
- No timing overhead now.
- Some area overhead.



## **Output CLKed latch**



#### **BL/shifter MUX pattern for INT8/BF16/FP16**

FP16 needs reconfigure 8bit ring shifter to 10bit. We show the MUX pattern here. (GBL is similar)



BF16: S1 closed S2 open, C[8] out disabled;

FP16: S1 open S2 closed, E read out for add.

# Reuse CIM Bits as renaming physical register

The CIM Bits can be used as physical register for renaming, We show the normal read/write function here.



- 1. Assert mrst, reset to 1
- 2. Discharge GBL based on data and open mwe

- Discharge GBL and open cimGate
- 2. Toggle cimCLK

Read Write CIM Bits has lower energy and higher performance. This design did not implement register renaming.

#### Performance scaling simulation

We show the scalability of this design by increasing the lane numbers and increasing the VRF capacity. The multicore design evaluation is out of our scope.





This design scales good with number of lanes on both large and small matrix size.

This design scales simply with VRF size with enough memory bandwidth.