Articles for Developer by Paolo Medici [PMX.it]. See License on bottom of this page.

Introduction

C Structure

Intrinsic format

v<operator>[l|n|w|s][q]_<data format>

[l] long
[n] narrow
[w] wide
[s] saturating

[q] 128 bit register instead of 64 bit.

Data Format: u8, s8, u16, s16, u32, s32, u64, s64, f16, f32

Many NEON instructions are provided in Normal, Long, Wide, Narrow, and saturating variants.

Normal work on 64bit or 128bit element with input data of the same size of output.
Long operates on 64bit register and produces 128bit in output.
Wide operates on one 64 bit and one 128bit register and produces 128bit in output.
Narrow operates on 128bit and produce 64bit in output.

Here some example:

REINTERPRET: Reinterpet Cast

vreinterpret[q]_<input>_<output>

For Example:
vreinterpretq_u16_s16

COMBINE: Combine Vector

These intrinsics join two 64 bit vectors into a single 128bit vector.

int8x16_t   vcombine_s8(int8x8_t low, int8x8_t high);        // VMOV d0,d0
int16x8_t   vcombine_s16(int16x4_t low, int16x4_t high);     // VMOV d0,d0
int32x4_t   vcombine_s32(int32x2_t low, int32x2_t high);     // VMOV d0,d0
int64x2_t   vcombine_s64(int64x1_t low, int64x1_t high);     // VMOV d0,d0
float16x8_t vcombine_f16(float16x4_t low, float16x4_t high); // VMOV d0,d0
float32x4_t vcombine_f32(float32x2_t low, float32x2_t high); // VMOV d0,d0
uint8x16_t  vcombine_u8(uint8x8_t low, uint8x8_t high);      // VMOV d0,d0
uint16x8_t  vcombine_u16(uint16x4_t low, uint16x4_t high);   // VMOV d0,d0
uint32x4_t  vcombine_u32(uint32x2_t low, uint32x2_t high);   // VMOV d0,d0
uint64x2_t  vcombine_u64(uint64x1_t low, uint64x1_t high);   // VMOV d0,d0
poly8x16_t  vcombine_p8(poly8x8_t low, poly8x8_t high);      // VMOV d0,d0
poly16x8_t  vcombine_p16(poly16x4_t low, poly16x4_t high);   // VMOV d0,d0

MOV: Move or Convert

MOV copy contents of a register in an another.

Convert 8 elements of unsigned16 in 8 elements of unsigned8:

        vmovn_u16(r)

Pack/UnPack/Zip/Reverse

VREV esegue uno swap di 8, 16 o 32 bit a coppie.
VZIP: unisce due vettori prendendo una parola di uno e una parola dell'altro
VZIP: separa un vettori, mettendo i pari in un vettore e i dispari nell'altro
VSWP: scambia due registri a 64 o 128 bit

ADD: Add

Add 2 register of 8 elements of 16 bit

vaddq_u16(ra, rb)

PADD: Horizontal (Pairwise) Add

Pairwise add with promotion from 16 elements of unsigned8 to 8 elements of unsigned16:

vpaddlq_u8(r)

SHR: Shift Right

Shift 8 elements of unsigned16

vshrq_n_u16(r, n)

PMAX: Max

Max between two register.

vpmax_f32(ra,rb)

LD1: Load

Load 16 byte (128 bit) and cast to 16 elements of uint8

r = vld1q_u8(ptr)

Load 16 byte (128 bit) and cast to 4 elements of int32

r = vld1q_s32(ptr)

ST1: Store

Store 8 byte (64 bit)

        vst1_u8(ptr, r)

DUP: Duplicate value

Create 4 elements of float32 from one element:

        vdupq_n_f32(n)

Create 4 elements of int32 from one element:

        vdupq_n_s32(n)