Introduction
C Structure
Intrinsic format
v<operator>[l|n|w|s][q]_<data format>
[l] long
[n] narrow
[w] wide
[s] saturating
[q] 128 bit register instead of 64 bit.
Data Format: u8, s8, u16, s16, u32, s32, u64, s64, f16, f32
Many NEON instructions are provided in Normal, Long, Wide, Narrow, and saturating variants.
Normal work on 64bit or
128bit element with input data of the same size of output.
Long operates on 64bit
register and produces 128bit in output.
Wide operates on one 64
bit and one 128bit register and produces 128bit in output.
Narrow operates on
128bit and produce 64bit in output.
Here some example:
REINTERPRET: Reinterpet Cast
vreinterpret[q]_<input>_<output>
For Example:
vreinterpretq_u16_s16
COMBINE: Combine Vector
These intrinsics join two 64 bit vectors into a single 128bit vector.
int8x16_t vcombine_s8(int8x8_t low, int8x8_t high); // VMOV d0,d0 int16x8_t vcombine_s16(int16x4_t low, int16x4_t high); // VMOV d0,d0 int32x4_t vcombine_s32(int32x2_t low, int32x2_t high); // VMOV d0,d0 int64x2_t vcombine_s64(int64x1_t low, int64x1_t high); // VMOV d0,d0 float16x8_t vcombine_f16(float16x4_t low, float16x4_t high); // VMOV d0,d0 float32x4_t vcombine_f32(float32x2_t low, float32x2_t high); // VMOV d0,d0 uint8x16_t vcombine_u8(uint8x8_t low, uint8x8_t high); // VMOV d0,d0 uint16x8_t vcombine_u16(uint16x4_t low, uint16x4_t high); // VMOV d0,d0 uint32x4_t vcombine_u32(uint32x2_t low, uint32x2_t high); // VMOV d0,d0 uint64x2_t vcombine_u64(uint64x1_t low, uint64x1_t high); // VMOV d0,d0 poly8x16_t vcombine_p8(poly8x8_t low, poly8x8_t high); // VMOV d0,d0 poly16x8_t vcombine_p16(poly16x4_t low, poly16x4_t high); // VMOV d0,d0
MOV: Move or Convert
MOV copy contents of a register in an another.
Convert 8 elements of unsigned16 in 8 elements of unsigned8:
vmovn_u16(r)
Pack/UnPack/Zip/Reverse
VREV esegue uno swap di 8, 16 o 32 bit a coppie.
VZIP: unisce due vettori prendendo una parola di uno e una
parola dell'altro
VZIP: separa un vettori, mettendo i pari in un vettore e i
dispari nell'altro
VSWP: scambia due registri a 64 o 128 bit
ADD: Add
Add 2 register of 8 elements of 16 bit
vaddq_u16(ra, rb)
PADD: Horizontal (Pairwise) Add
Pairwise add with promotion from 16 elements of unsigned8 to 8
elements of unsigned16:
vpaddlq_u8(r)
SHR: Shift Right
Shift 8 elements of unsigned16
vshrq_n_u16(r, n)
PMAX: Max
Max between two register.
vpmax_f32(ra,rb)
LD1: Load
Load 16 byte (128 bit) and cast to 16 elements of uint8
r = vld1q_u8(ptr)
Load 16 byte (128 bit) and cast to 4 elements of int32
r = vld1q_s32(ptr)
ST1: Store
Store 8 byte (64 bit)
vst1_u8(ptr, r)
DUP: Duplicate value
Create 4 elements of float32 from one element:
vdupq_n_f32(n)
Create 4 elements of int32 from one element:
vdupq_n_s32(n)