Assembly or Optimized C for Lightweight Cryptography on ... · Saturnin-Hash 49433 28199 ( 43%) -...

Assembly or Optimized C for Lightweight

Cryptography on RISC-V?

CANS 2020

19th International Conference on Cryptology and Network Security

Fabio Campos1, Lars Jellema2, Mauk Lemmen2, Lars Muller1,

Daan Sprenkels2, Benoit Viguier2

1RheinMain University of Applied Sciences, Germany2Radboud University, The Netherlands

Outline

2 / 100

Basic idea

� analysis of strategies for optimizing lightweight cryptography

� several (round 2) candidates of NIST’s lightweight

cryptography standardization project

� on a RISC-V architecture

� assembly or Optimized C?

3 / 100

Preliminaries

RISC-V

� a new open reduced instruction set architecture (ISA)

� research project that started at UC Berkely in 2010

� a serious competitor to ARM?

� frozen 32-bit base ISA (RV32I) and 64-bit (RV64I)

� general implementation strategies are derived and discussed

� 32 32-bit registers x0–x31

� Basic three-operand arithmetic, bitwise, basic shift &

load/store instructions

� No: rotate instructions, carry flag, nice bit operation

instruction

� Compensated by extensions: M, A, F, D, B, V, C, ...

4 / 100

Environment

SiFive HiFive1 Board

� 5-stage single-issue in-order pipelined RV32IMAC E31 CPU

� < 384 MHz, 64 KiB RAM, 16 MiB flash

� 16 KiB instruction cache

5 / 100

Optimized C

� optimized implementation usually written directly in assembly

� considering the small size of the RISC-V ISA

� ensuring no branch on secret data, code mimics assembly

instructions

� translation of an assembly implementation back into C

� compiler further optimize and take care of register allocation

6 / 100

Optimized Algorithms

Gimli 1/2

� NIST lightweight round 2 candidate

� lightweight scheme (Gimli-Hash & Gimli-Cipher)

� 384-bit (3 x 4 matrix of 32-bit words) permutation

7 / 100

Gimli - optimization 2/2

� careful scheduling of instructions

� saving only 4 callee into the stack

� unrolling in C over 8 rounds

� renaming variable to avoid move operations

� speed-up up to 19%

type C-ref Assembly Optimized C

Gimli 2178 2092 (−4%) 1900 (−13%)

Gimli-Hash 23120 20812 (−10%) 18678 (−19%)

Gimli-Cipher 44423 39583 (−10%) 35853 (−19%)

Table 1: Cycle counts on the SiFive board

8 / 100

Sparkle 1/2

� family of permutations (Sparkle) based on the block cipher

� Schwaemm (AEAD) and Esch (hash function)

� Schwaemm works on 384 bits (Sparkle384)

� Esch works on 256 bits (Sparkle256)

9 / 100

Sparkle - optimization 2/2

� loop unrolling

� avoid loading of round constants

mode C-Ref. Optimized C

Esch256 58193 33331 (−42%)

Schwaemm256-128 71271 42634 (−40%)

Table 2: Cycle counts for Esch256 and Schwaemm256-128

10 / 100

Saturnin 1/2

� based on a 256-bit block cipher with a 256-bit key

� Saturnin-Hash (hash function) and Saturnin-Cipher (AEAD)

11 / 100

Saturnin - optimization 2/2

� two bitsliced variants analyzed:

bs32 bitslices inside of block

bs32x bitslices across blocks

� speed-up the loading of constants

� greedy unrolling and inlining

mode C-Ref. bs32 bs32x

Saturnin-Cipher 121651 59368 (−51%) 68792 (−43%)

Saturnin-Hash 49433 28199 (−43%) -

Table 3: Cycle counts for Saturnin-Cipher (128 AD bytes and 128

message bytes) and Saturnin-Hash

12 / 100

Ascon 1/2

� 320-bit state

� Ascon-Hash based on eXtendable output function Ascon-Xof

� Ascon (AEAD) based on duplex construction

13 / 100

Ascon - optimization 2/2

� reduce the number of required instructions in the S-Box

� improved S-box in a 6-round unrolled permutation

14 / 100

Elephant 1/2

� family of lightweight authenticated encryption schemes

� Delirium = Elephant-Keccak-f[200]

� 200-bit state

15 / 100

Elephant - optimization 2/2

� parallelization through bit interleaving

� combining 4 blocks into 1 block of 4-byte elements

� state 25 32-bit words (5-by-5-by-32) with size of 800 bits

message/data length C-ref bit interleaved

16/16 66541 73989 (+11%)

64/64 143181 74890 (−47%)

128/128 241975 145936 (−53%)

Table 4: Cycle counts on the SiFive board

16 / 100

Xoodyak 1/2

� scheme suitable for several symmetric-key function

� hashing, encryption, MAC computation and authenticated

encryption

� 384-bit state

� based on the Xoodoo permutation

17 / 100

Xoodyak - optimization 2/2

� lane Complementing: reduce number of NOT instructions

� loop unrolling

mode Ref unrolled & lane comp.

hash 82741 17963 (−79%)

AEAD 103522 23238 (−18%)

Table 5: Cycle counts for Xoodyak on the SiFive board

18 / 100

� based on RISC-V available assembly implementation1

� combined multiple steps of the round function in a lookup

table (T-table)

� ”safe”, since none of our benchmarking platforms have data

� using a bitsliced approach, multiple blocks processed in parallel

� speed-up up to 4% compared to the assembly version

1https://github.com/Ko-/riscvcrypto19 / 100

Comparison & Benchmark

Comparison

Algorithm Weatherley2 our results

Gimli 38530 35853 (−7%)

Schwaemm256-128 72286 43877 (−40%)

Saturnin 152803 59368 (−61%)

Ascon 42562 27271 (−36%)

Delirium 765235 145936 (−81%)

Xoodyak 64869 26246 (−60%)

Table 6: Cycle counts for AEAD mode on SiFive (128 bytes of message

and 128 bytes of associated data)

2https://github.com/rweather/lightweight-crypto, commit 52c828120 / 100

Conclusion

� translating assembly implementation back into C leads to

further speed-ups

� approach applicable to existing code bases: AES & Keccak

assembly implementations

� fully unrolled loops may fail on physical devices (instruction

cache)

� applying the bit manipulation extension3 (B) leads to a

reduction of instructions up to 66%

3https://github.com/riscv/riscv-bitmanip, commit a05231d21 / 100

Thank you for your attention!

Paper: https://ia.cr/2020/836

Code: https://github.com/AsmOptC-RiscV/Assembly-Optimized-C-RiscV

22 / 100

Assembly or Optimized C for Lightweight Cryptography on ... · Saturnin-Hash 49433 28199 ( 43%) -...

Documents

SinOne SC92F7252/7251/7250 · 2021. 1. 4. · 高速1T 8051 内核Flash MCU ，256 bytes SRAM ，4 Kbytes Flash，128 bytes 独立EEPROM，12 位ADC，6 路8 位PWM，3 个定时器，UART

Technology Bytes

ATtiny22(L) Preliminary - 8-bit AVR Microcontroller with ... · 3 ATtiny22/22L The ATtiny22/L provides the following features: 2K bytes of In-System Programmable Flash, 128 bytes

Grand Bytes

ROEVER ENGINEERING COLLEGE...There are 128 bytes of RAM in the 8051 and the assigned addresses are from 00 to 7FH. The 128 bytes are divided into three different groups as follows:

Media Bytes

ECE 646 Lecture 11 Hash functions & MACsece.gmu.edu/.../ECE646_lecture11_hash_MAC_2.pdf8 bytes n ≥ 128 16 bytes n ≥ 80 10 bytes n ≥ 160 20 bytes Current standards (e.g., SHA-2,

OpticalCommunicationSystem QSFP+ · 5 SerialID:DataFields(Page00) Address Size(Bytes) Name Description of Base IDField OpticalModul e 128 1 Identifier IdentifierTypeofserial Module

Chapter 05€¦ · making a good case for an L3 block size of at least 128 bytes. The L3 cache is 2 MB, two-way set associative

IT Bytes or B2B Bytes?

What am I working on and why? What did I do last year?riju/aps2_ppt.pdf · 128 point FFT needs 128 samples Each sample 2 bytes 125 windows of 128 samples at 16 Khz Each window has

1 GENERAL DESCRIPTION › ... › SC92F7323_7322_7321_7320en.pdf · 2020-03-11 · High-speed 1T 8051-based Flash MCU,512 bytes SRAM, 8 Kbytes Flash, 128 bytes independent EEPROM,

8-bit 8K/4K/2K Bytes EEPROM Microcontroller with Low ... 8393C-MCU Wireless-09/14 ATmega256/128/64RFR2 •

Middlesex University Research Repositoryeprints.mdx.ac.uk › 13750 › 7 › merged_document.pdf · SJMP (‘short jump’) 2 2 -128 to 127 bytes AJMP (‘absolute jump’) 2 2 one

ARM - Dataman · – 64 kbytes, organized in 512 Pages of 128 Bytes (AT91SAM7S64) – 32 kbytes, organized in 256 Pages of 128 Bytes ... – Software Power Optimization Capabilities,

Sports Loisirs Nature Saint Saturnin (63450)

The Domain Name SystemIP Addresses An end system is identiﬁed and addressed by its IP address 32 bits (4 bytes) in IPv4 e.g., 195.176.181.10 128 bits (16 bytes) in IPv6 e.g., fe80::211:43ﬀ:fecd:30f5/64

M24SR04 – NFC Type 4 Dynamic Tag with 512 bytes of · PDF file• Write up to 246 bytes in a single command • 7 bytes unique identifier (UID) • 128 bits passwords protection

UNIT-IV MEMORY AND INPUT/OUTPUT ORGANIZATION · The RAM chips have 128 bytes and need seven address lines. The ROM chip has 512 bytes and needs 9 address lines. It is now necessary

CHAPTER 3 6LoWPAN · function of this layer is the TCP/IP header compression. The IEEE 802.15.4 frame has a maximum packet size of 128 bytes, whereas IPv6 header size is 40 bytes,