# All-Digital Time-Domain CNN engine using Bidirectional Memory Delay Lines for Energy Efficient Edge Computing

<u>Aseem Sayal</u>, Shirin Fathima, S.S. Teja Nibhanupudi, Jaydeep P. Kulkarni

Department of Electrical and Computer Engineering The University of Texas at Austin, TX



#### Outline

- Motivation
- Concept of Time-domain MAC
- Proposed CNN Engine Architecture
- Chip Implementation and Measurements
- Comparison with Prior Work
- Summary

# **Need: Energy Efficient Edge Computing**



#### **Applications**

- Image Classification
- Speech Recognition
- Object Detection

Energy efficient edge processing required for privacy and minimal data transmission

# **Convolutional Neural Network (CNN)**

- State-of-the-art classification accuracy for image/speech recognition
- Multiple filters to extract specific features
- Memory dominant and compute intensive
- Power hungry MAC operations in convolution

Multiply-Accumulate-Average (MAV)





Ref: LeCun et al., Gradient-based learning applied to document recognition. Proc. Of IEEE, 1998



Digital domain: multi-bit digital vector –
 high C<sub>DYN</sub> and toggle activity
 Consequently switching power



- Digital domain: multi-bit digital vector –
   high C<sub>DYN</sub> and toggle activity
   Consequently switching power
- Analog Voltage domain: continuously varying voltage signal – limited voltage scalability and need of power and area intensive ADCs/DACs



- Digital domain: multi-bit digital vector –
   high C<sub>DYN</sub> and toggle activity
   Consequently switching power
- Analog Voltage domain: continuously varying voltage signal – limited voltage scalability and need of power and area intensive ADCs/DACs
- Frequency domain: frequency varying signal – need of accurate frequency generators/modulators
  - → limits performance scalability



**Time domain:** multi-bit digital bit stream encoded as PWM signal – smaller  $C_{DYN}$ and toggle activity, and ultra-low voltage operation resulting in low power, no need of multiple clocks/frequency modulators

- Digital domain: multi-bit digital vector –
   high C<sub>DYN</sub> and toggle activity
   Consequently switching power
- Analog Voltage domain: continuously varying voltage signal – limited voltage scalability and need of power and area intensive ADCs/DACs
- Frequency domain: frequency varying signal – need of accurate frequency generators/modulators
  - → limits performance scalability

# Time Domain Approach vs. Digital Domain



PPA – Power Performance Area Product PNR – Place and Route

## **Prior Work: Analog Voltage Approaches**



# **Prior Work: Digital, Time and Frequency domain**



#### Outline

- Motivation
- Concept of Time-domain MAC
- Proposed CNN Engine Architecture
- Chip Implementation and Measurements
- Comparison with Prior Work
- Summary



MAC = 123(1) + 60(0) + 43(1) + 128(0) + 255(1) + 16(1) + 89(0) + 209(0) + 76(1)



MAC = 123(1) + 60(0) + 43(1) + 128(0) + 255(1) + 16(1) + 89(0) + 209(0) + 76(1)

DTC – Digital to Time Converter

© 2019 IEEE International Solid-State Circuits Conference

14.4: All-Digital Time-Domain CNN engine using Bi-directional Memory Delay Lines for Energy Efficient Edge Computing



MAC = 123(1) + 60(0) + 43(1) + 128(0) + 255(1) + 16(1) + 89(0) + 209(0) + 76(1)



- Motivation
- Concept of Time-domain MAC
- Proposed CNN Engine Architecture
- Chip Implementation and Measurements
- Comparison with Prior Work
- Summary















#### **PWM Pulse Generation & Selection**



© 2019 IEEE International Solid-State Circuits Conference



### **Bi-directional Memory Delay Line**



#### **Bi-directional Memory Delay Line**



# MDL Approach inspired by Time Register



#### **Bi-directional Memory Delay Line**



#### **Bi-directional Memory Delay Line Unit**



#### MDL Unit – Delay line (Positive weight)



#### MDL Unit – Delay Line (Negative weight)



#### **MDL Unit – Memory Line**



No tristate cells -> State of the delay line is retained in memory phase

#### **Bi-directional Memory Delay Line Unit**



# **Bi-directional MDL Implementation**



## **Bi-directional MDL Implementation**











International Solid-State Circuits Conference

## **Mitigating Process Variations**



Switch S9 value == SIGN, Switch S10 value == (SIGN)'

Calibration unit is added to all the 4 MDLs for each filter. Due to process variations, delay mismatch among 4 MDLs for each filter are compensated by by tuning cal\_bit[0:2] and cal\_enable signals.

## Increasing Throughput: Speedup modes

| MSB Phase (4 MSBs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | LSB Ph                                                                | ase (4 LSBs)                                       |                                                             |                                            |                   |                   |                   |                  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------------------|-------------------------------------------------------------|--------------------------------------------|-------------------|-------------------|-------------------|------------------|
| $\begin{array}{c} 0 & 240t_0 & 2\\ MSB_EN & 240t_0 & \\ T15 & 240t_0 & \\ \hline \end{array}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | $40t_0 255t_0 240t_0 2$                                               | → <u>↓</u>                                         | 240t <sub>0</sub> 255t <sub>0</sub>                         | Speedup<br>Mode                            | 1x                | 4x                | 8x                | 16x              |
| T14 $224t_0$<br>T13 $208t_0$<br>T12 $192t_0$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | $14t_0$ 15t<br>13t_13t_12t_12t_12t_12t_12t_12t_12t_12t_12t_12         | 0 <sup>L</sup> 15t <sub>0</sub> <sup>L</sup>       | 15t <sub>0</sub><br>15t <sub>0</sub>                        | Input clock<br>period                      | 2t <sub>o</sub>   | 8t <sub>o</sub>   | 16t <sub>o</sub>  | 32t <sub>o</sub> |
| T11 $$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | $ 11t\rangle   12t_0$ $ 10t\rangle   12t_0$ $ 10t\rangle   0R   8t_0$ | $\begin{array}{c} + \\ + \\ + \\ 8t_0 \end{array}$ | 15t <sub>0</sub><br>15t <sub>0</sub><br>0R 15t <sub>0</sub> | MAC clock<br>period                        | 256t <sub>o</sub> | 256t <sub>o</sub> | 256t <sub>o</sub> | $256t_o$         |
| T8<br>T7<br>T7<br>T6<br>T5<br>T5<br>T8<br>T8<br>T8<br>T28t <sub>0</sub> +<br>128t <sub>0</sub> +<br>T12t <sub>0</sub> +<br>T5<br>T5 | $ \begin{array}{cccccccccccccccccccccccccccccccccccc$                 |                                                    | 15t <sub>0</sub>                                            | # input clock<br>cycles/MAC<br>clock cycle | 128               | 32                | 16                | 8                |
| T4<br>T3<br>T2<br>T1<br>T0<br>T0<br>T4<br>$+64t_0+1$<br>$48t_0+1$<br>$32t_0^1$<br>$16t_0$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | $\begin{array}{cccccccccccccccccccccccccccccccccccc$                  | 8t <sub>0</sub>                                    |                                                             | Quantization<br>error<br>(input: 0-255)    | 0                 | ±2                | ±4                | ±8               |
| Precision (Speed-up) Mode:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 1x 4x                                                                 | 8x                                                 | 16x                                                         |                                            |                   |                   |                   |                  |

## **Proposed CNN Engine Implementation**



- Motivation
- Concept of Time-domain MAC
- Proposed CNN Engine Architecture
- Chip Implementation and Measurements
- Comparison with Prior Work
- Summary

#### **Overall Dataflow Diagram**



### **Overall Dataflow Diagram**



### **Overall Dataflow Diagram**



## **LeNet-5 Architecture & Parameters**



Ref: LeCun et al., Gradient-based learning applied to document recognition. Proc. Of IEEE, 1998

| Parameters for LeNet-5   | C1         | C3            |  |  |
|--------------------------|------------|---------------|--|--|
| Filter Size              | 5*5*1*6    | 5*5*6*16      |  |  |
| Input/Filter bit width   | 8bits/1bit | 8bits/ 1bit   |  |  |
| Input Size               | 32*32*1    | 14*14*6       |  |  |
| Output Size              | 14*14*6    | 5*5*16        |  |  |
| #Filters                 | 6          | 16            |  |  |
| #Operations/convolution* | (25*4*6)*2 | (150*4* 16)*2 |  |  |

\*Assuming 1 Multiply-Accumulate-Average, and 1 Pooling = 2 operations

© 2019 IEEE International Solid-State Circuits Conference 49 of 67

Hardware computations

Convolution, Averaging and Pooling (Fixed Point)

Software computations

FCN (16 bit Floating Point)

## 40nm test-chip die micrograph



### **MDL Functionality Demonstration**



➤ EN == 1 → MDL acts as a delay line

➤ EN == 0 → MDL acts as a memory line

## **Pulse Generator Functionality Demonstration**



14.4: All-Digital Time-Domain CNN engine using Bi-directional Memory Delay Lines for Energy Efficient Edge Computing

### Speedup (1x-16x) Modes Demonstration



### Speedup (1x-16x) Modes Demonstration



214

## Speedup (1x-16x) Modes Demonstration



# Measured Accuracy @LeNet-5





Speedup Modes

Weights: +1/-1

- Convolution Layers: 8-bit fixed point inputs (Hardware/Test-chip)
- FCN and Software implementation: 16-bit floating point inputs and weights

#### Weights (+1/-1):

> Software: 98.81%

#### Classification Accuracy

- Test-chip:
   ✓ 98% for speedup modes 1x-8x
  - $\checkmark$  97% for speedup mode 16x

© 2019 IEEE International Solid-State Circuits Conference

(software)

# Measured Accuracy @LeNet-5



- Convolution Layers: 8-bit fixed point inputs (Hardware/Test-chip)
- FCN and Software implementation: 16-bit floating point inputs and weights

#### Weights (+1/-1): ➤ Software: 98.81%

#### Classification Accuracy

- ✓ 98% for speedup modes 1x-8x
- ✓ 97% for speedup mode 16x sights (0/+1):

#### Weights (0/+1):

- Software: 97.62%
- ➤ Test-chip:

 $\succ$  Test-chip:

- $\checkmark$  97% for speedup modes 1x-8x
- $\checkmark$  96% for speedup mode 16x

Hardware results for 100 test images

## Measured Accuracy vs. Voltage



## Measured Throughput vs. Voltage



## Measured Energy Efficiency vs. Voltage



## Simulation Results for AlexNet



- Dataset: Subset of ImageNet dataset (Classes -Cats and Dogs)
- Convolution and Pooling 8 bit fixed point inputs and weights
- Fully Connected Network Layers 16 bit floating point inputs and weights
- > 13% accuracy drop observed in simulation
  - Multiple MDL lines are used for multi-bit weights dot product
  - Residue from all MDLs → higher accuracy loss

## **Performance Summary**

| LeNet-5 Results/Metrics     | Conv  | olution | Laye  | r – C1 | <b>Convolution Layer – C3</b> |        |        |       |  |
|-----------------------------|-------|---------|-------|--------|-------------------------------|--------|--------|-------|--|
| Speedup Mode                |       | 4x      | 8x    | 16x    | 1x                            | 4x     | 8x     | 16x   |  |
| Input clock frequency (MHz) | 24.0  | 24.0    | 24.0  | 24.0   | 24.0                          | 24.0   | 24.0   | 24.0  |  |
| MAC clock frequency (MHz)   | 0.19  | 0.75    | 1.50  | 3.00   | 0.19                          | 0.75   | 1.50   | 3.00  |  |
| Convolution Cycle Time (us) | 149.3 | 37.33   | 18.67 | 9.33   | 842.67                        | 210.67 | 105.33 | 52.67 |  |
| Operating Voltage (V)       | 0.537 | 0.537   | 0.537 | 0.537  | 0.537                         | 0.537  | 0.537  | 0.537 |  |
| Power (µW)                  | 28.67 | 28.67   | 28.67 | 28.67  | 30.17                         | 30.17  | 30.17  | 30.17 |  |
| Throughput (GOPS)           | 0.008 | 0.032   | 0.064 | 0.128  | 0.023                         | 0.091  | 0.183  | 0.365 |  |
| Energy Efficiency (TOPS/W)  | 0.29  | 1.16    | 2.33  | 4.65   | 0.76                          | 3.02   | 6.04   | 12.08 |  |

Peak Energy Efficiency: 12.08(4.65) TOPS/W for C3(C1) layers @537mV

Peak Throughput: 0.365(0.128) GOPS for C3(C1) layers @537mV

### Outline

- Motivation
- Concept of Time-domain MAC
- Proposed CNN Engine Architecture
- Chip implementation and measurements
- Comparison with Prior Work
- Summary

## **Comparison with prior approaches**

| Reference              | Tech.<br>(nm) | Circuit<br>Type | weight          | Chip<br>Size<br>(mm²) | Pool | VCC | Cap.<br>or<br>ADCs | ACCU- | Throu-<br>ghput<br>(GOPS) | Power<br>(µW) | Energy<br>Efficiency<br>(TOPS/W) |
|------------------------|---------------|-----------------|-----------------|-----------------------|------|-----|--------------------|-------|---------------------------|---------------|----------------------------------|
| ISSCC'18 [1]           | 65            | Analog          | 6/1bits         | 0.067                 | No   | No  | Yes                | 96.0% | 10.70                     | 380.7         | 28.10                            |
| ISSCC'18 [2]           | 65            | Analog          | 8bits           | 1.44                  | No   | No  | Yes                | 96.0% | -                         | -             | 3.125                            |
| ISSCC'16 [3]           | 65            | Digital         | 16bits          | 16.00                 | Yes  | No  | No                 | 98.3% | 64                        | 4.51E+4       | 1.42                             |
| VLSI'16 [4]            | 40            | Digital         | 6/4bits         | 2.40                  | No   | Yes | No                 | 98.0% | 102                       | 3.90E+4       | 2.60                             |
| CICC'17 [5]            | 65            | Time            | 8/3bits         | 0.24                  | No   | Yes | Yes                | 91.0% | 0.396                     | 2.05E+4       | 0.019                            |
| ISSCC'18 [6]           | 55            | Time            | 6/6bits         | 3.125                 | No   | Yes | No                 | -     | 2.152                     | 690           | 3.12                             |
| This work<br>(MDL CNN) | 40            | Time            | 8bits/<br>1bit* | 0.124                 | Yes  | Yes | No                 | 97.0% | 0.365                     | 30.17         | 12.08                            |

\*Scalable to multi-bit weights

References mentioned in paper

### Conclusions

- Proposed Bidirectional Memory Delay Line for energy efficient time-domain MAC computation
- All-digital compact and technology scaling friendly design (no ADCs, DACs, frequency modulators)
- Low power design supporting near-threshold voltage operation and 16x performance boost with 4 input encoding modes
- Configurable MDL lengths with on-chip pooling and averaging operations
- Demonstrated the proposed time-domain CNN engine in 40nm CMOS node achieving 12.08 TOPS/W energy efficiency and 0.365 GOPS throughput at 537mV

### Acknowledgements

- Authors would like to thank TSMC University shuttle program for the test- chip fabrication support.
- > Authors would like to thank AMD for the financial support
- Authors also thank Vignesh Radhakrishnan and Jacob Rohan (graduate students in ECE, UT Austin) for helping with the test-chip measurements.

