Articles for Developer by Paolo Medici [PMX.it]. See License on bottom of this page.


Now there is a web page, hosted by Intel, with all the intrinsic

http://software.intel.com/sites/landingpage/IntrinsicsGuide/

It is the best choice start from this page.

Data Types

The __m64 data type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.

The __m128 data type is used to represent the contents of a Streaming SIMD Extension register used by the Streaming SIMD Extension intrinsics. Conventionally, the __m128 data type can hold four 32-bit floating-point values, while the __m128d data type can hold two 64-bit floating-point values, and the __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.

The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values, while the __m256d data type can hold four 64-bit double precision floating-point values, and the __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values [1]

Sintax for SSE*

_mm_<intrin op>_<suffix>

suffix is composed by a p (all the elements), ep (extended packed), s (single, the first element of the vector) and an additional string
s (float), d (double), i128 (integer 128bit), i64 (integer 64bit), u64 (unsigned 64bit), i32 (integer 32bit), u32 (unsigned 32bit), i16 (integer 16bit), u16 (unsigned 16bit), i8 (integer 8bit), u8 (unsigned 8bit)

Valid combination are

s
d
i128
i64
u64
i32
u32
i16
u16
i8
u8
p
ps
pd



pi32
pu32
pi16
pu16
pi8
pu8
ep



epi64

epi32
epu32
epi16
epu16
epi8
epu8
s
ss

si128
si64







This document covers intrinsics of following set   

SSE
SSE2
SSE3
SSSE3   
SSE4 (SSE4.1)

Sintax for AVX*

_mm256_<intrin_op>_<suffix>


Intrinsics

Bit Operations | Operazioni su bit

SSE (packed floating point), SSE2 (128bit operation)

_mm_and_ps 	(SSE)
_mm_and_pd (SSE2)
_mm_and_si128 (SSE2)
_mm256_and_ps (AVX)
_mm256_and_pd (AVX)

_mm_andnot_ps (SSE)
_mm_andnot_pd (SSE2)
_mm_andnot_si128 (SSE2) _mm256_andnot_ps (AVX)
_mm256_andnot_ps (AVX)

_mm_or_ps (SSE)
_mm_or_pd (SSE2)
_mm_or_si128 (SSE2)
_mm256_or_ps (AVX)
_mm256_or_pd (AVX)


_mm_xor_ps (SSE)
_mm_xor_pd (SSE2)
_mm_xor_si128 (SSE2)
_mm256_xor_ps (AVX)
_mm256_xor_pd (AVX)

Comparison | Confronto

SSE, SSE2

Il confronto genera in output un registro con 0xffffffff se il confronto ha avuto successo o 0x0 altrimenti.

_mm_cmpeq_ss	(SSE)
_mm_cmpeq_ps	(SSE)
_mm_cmpeq_pd	(SSE2)
_mm_cmpeq_epi8 (SSE2)
_mm_cmpeq_epi16 (SSE2)
_mm_cmpeq_epi32 (SSE2)

al posto di <eq> si puo' usare uno dei tanti operatori di confronto.

Il confronto genera in output un registro con 0x1 se il confronto ha avuto successo o 0x0 altrimenti.

_mm_comieq_sd (SSE2)

Sum | Somma

SSE2, SSE3, AVX

_mm_add_sd (SSE2)
_mm_add_pd (SSE2)

_mm256_add_ps (AVX)
_mm256_add_pd (AVX)
_mm_add_epi8 (SSE2)
_mm_add_epi16 (SSE2)
_mm_add_si64 (SSE2)
_mm_add_epi64 (SSE2)

Sum With Saturation | Somma Con Saturazione

_mm_adds_...
_mm_adds_epi8 (SSE2)
_mm_adds_epi16 (SSE2)
_mm_adds_epu8 (SSE2)
_mm_adds_epu16 (SSE2)

Subtraction | Sottrazione 

SSE2, SSE3, AVX

_mm_sub_...      
_mm_sub_sd (SSE2)
_mm_sub_pd (SSE2)

_mm256_sub_pd (AVX)
_mm256_sub_ps (AVX)

Subtraction With Saturation | Sottrazione Con Saturazione

_mm_adds_...
_mm_adds_epi8 (SSE2)
_mm_adds_epi16 (SSE2)
_mm_adds_epu8 (SSE2)
_mm_adds_epu16 (SSE2)
_mm_subs_...

Sum and Subtraction | Somma e Sottrazione

SSE3, AVX

Somma degli elementi pari, sottrazione elementi dispari (SSE3):

_mm_addsub_ps (SSE3)
_mm_addsub_pd (SSE3)

_mm256_addsub_ps (AVX)
_mm256_addsub_pd (AVX)

Horizontal Sum | Somma Orizzontale

SSE3, SSSE3, AVX

out_i = a_{2*i} + a_{2*i+1}

64 bit register:

_mm_hadd_pi32 (SSSE3)
_mm_hadd_pi16 (SSSE3)

128 bit register:

_mm_hadd_epi16 (SSSE3)      
_mm_hadd_epi32 (SSSE3)

_mm_hadd_ps (SSE3)

256 bit register:
_mm256_hadd_pd 	(AVX)
_mm256_hadd_ps (AVX)

Horizontal Sum with Saturation | Somma Orizzontale con Saturazione

_mm_hadds_epi16 (SSE3)

_mm_hsubs_epi16 (SSE3)

Horizontal Subtraction | Sottrazione Orizzontale

SSE3, SSSE3

out_i = a_{2*i} + a_{2*i+1}

64 bit register:


_mm_hsub_pi32 (SSSE3)
_mm_hsub_pi16 (SSSE3)

128 bit register:

_mm_hsub_epi16 (SSSE3)

_mm_hsub_epi32 (SSSE3)

_mm_hsub_ps (SSE3)
256 bit register:

_mm256_hsub_pd	(AVX)
_mm256_hsub_ps (AVX)

Horizontal Subtraction with Saturation | Sottrazione Orizzontale con Saturazione

 
_mm_hsubs_pi16 (SSSE3) _mm_hadds_pi16 (SSSE3)

Multiplication | Moltiplicazione

SSE2, SSE4


_mm_mul_su32 (SSE2) _mm_mul_epu32 (SSE2)
_mm_mul_epi32 (SSE4)

Moltiplicazione con troncamento dei bit alti:

_mm_mulllo_epi16 (SSE2)
_mm_mullo_epi32 (SSE4)

Multiplication and Horizontal Sum | Moltiplicazione e Somma orizzontale

SSE2, SSE3

out_i = a_{2*i}*b_{2*i} + a_{2*i+1}*b_{2*i+1}

8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Output 4 signed 32-bit integer results. In uscita 4 elementi da 32bit. (128bit registers)

_mm_madd_epi16 	(a,b)	(SSE2)

out_i = SATURATE ( a_{2*i}*b_{2*i} + a_{2*i+1}*b_{2*i+1} )

Input: 16 elements 8-bit (a: unsigned, b: signed) ad output eight 16-bit signed integers (128 bit registers)

_mm_maddubs_epi16 (a,b)	(SSSE3)

Input: 8 elementi da 8-bit (a: unsigned, b: signed) , output  four 16-bit signed integers (64 bit registers)

_mm_maddubs_pi16 (a,b)	(SSSE3)


Division | Divisione

_mm256_div_pd (AVX)
_mm256_div_ps (AVX)

Reciprocal | Reciproco

_mm256_rcp_ps (AVX)

Dot Product | Prodotto Scalare

_mm256_dp_ps (AVX)

Square Root | Radice Quadrata

_mm256_sqrt_pd (AVX)
_mm256_sqrt_ps (AVX)

1/Square Root | Reciproco della Radice Quadrata

_mm256_rsqrt_ps (AVX)

Average | Media

SSE, SSE2

out_i = (a_i  + b_1 + 1)>>1;

64bit register

_mm_avg_pu8 (SSE)
_mm_avg_pu16 (SSE)

128 bit register

_mm_avg_epu8 (SSE2)
_mm_avg_epu16 (SSE2)

Sum of Absolute Value | Somma in valore assoluto

SSE, SSE2
out_0 = abs(a_0 - b_0) + ... abs(a_n - b_n)

_mm_sad_pu8 (SSE)
_mm_sad_epu8 (SSE2)
In questo caso vengono fatte due SAD da 8 elementi e il risultato memorizzato in R0 e R4

Absolute Value | Valore Assoluto

SSE3

out_i = abs(in_i)

128 bit register:
_mm_abs_epi8 	(SSE3)
_mm_abs_epi16 	(SSE3)
_mm_abs_epi32 	(SSE3)

64 bit register:

_mm_abs_pi8 	(SSE3)
_mm_abs_pi16 	(SSE3)
_mm_abs_pi32 	(SSE3)

Negazione numero

SSE3

out = -in

_mm_sign_epi8
_mm_sign_epi16
_mm_sign_epi32
_mm_sign_pi8
_mm_sign_pi16
_mm_sign_pi32

Rounding | Arrotondamento

SSE 4.1

out = ceil(in)

_mm_ceil_pd
_mm_ceil_ps
_mm_ceil_sd
_mm_ceil_ss

out = floor(in)

_mm_floor_pd
_mm_floor_ps
_mm_floor_sd
_mm_floor_ss


out = round(in)

_mm_round_pd
_mm_round_ps
_mm_round_sd
_mm_round_ss

Max | Massimo

SSE, SSE2, SSE4.1

out_i = max(a_i, b_i)

_mm_max_sd (SSE2)
_mm_max_pd (SSE2)
_mm_max_pi16 (SSE)
_mm_max_pu8 (SSE)
_mm_max_epi16 (SSE2)
_mm_max_epu8 (SSE2)
_mm_max_epi8 (SSE4.1)
_mm_max_epi32 (SSE4.1)
_mm_max_epu16 (SSE4.1)
_mm_max_epu32 (SSE4.1)

Min | Minimo

out_i = min(a_i, b_i)

_mm_min_sd (SSE2)
_mm_min_pd (SSE2)

_mm_min_pi16 (SSE)
_mm_min_pu8 (SSE)
_mm_min_epi16 (SSE2)
_mm_min_epu8 (SSE2)
_mm_min_epi8 (SSE4.1)
_mm_min_epi32 (SSE4.1)
_mm_min_epu16 (SSE4.1)
_mm_min_epu32 (SSE4.1)

_mm_minpos_epu16 (SSE4.1)

Loading

SSE, SSE2

carica 1 valore e lo replica in tutte le altre parole

_mm_load1_ps (SSE)

_mm_load1_pd (SSE2)

carica 4 valori, allineamento memoria

_mm_load_ps (SSE)
_mm_load_pd (SSE2)

carica 4 valori, memoria non allineata

_mm_loadu_ps (SSE)
_mm_loadu_pd (SSE2)

carica 4 valori e ne inverte l'ordine

_mm_loadr_ps (SSE)

carica 64bit e li mette nella parte bassa di un registro a 128 (SSE2)
_mm_loadl_epi64

carica 128bit (aligned memory) (SSE2)
_mm_load_si128  
Imposta un valore su tutte le 4 parole
_mm_set1_ps (SSE) *
_mm_set1_pd (SSE2) *

Crea un registro di tutti zero:

_mm_setzero_si64 (SSE2)
_mm_setzero_si128 (SSE2)
_mm_setzero_ps (SSE2)
_mm_setzero_pd (SSE2)

Packing and Unpacking

SSE2

Intrammezza gli 8byte bassi (alti) di a con gli 8byte bassi (alti) di b (SSE2)
_mm_unpacklo_epi8(a,b)
_mm_unpackhi_epi8(a,b)
Intrammezza i 4 16bit bassi di a con i 4 16bit bassi di b    (SSE2)
_mm_unpacklo_epi16(a,b)
_mm_unpackhi_epi16(a,b)
Intrammezza i 2 32bit bassi di a con i 2 32bit bassi di b    (SSE2)
_mm_unpacklo_epi32(a,b)
_mm_unpackhi_epi32(a,b)
unisce i 64 bit bassi di a con i 64bit bassi di b    (SSE2)
_mm_unpacklo_ep64(a,b)
_mm_unpackhi_ep64(a,b)

Unisce 2 registri da 16bit con segno (totale 16 x 16bit) in uno da 16 x 8 bit, saturando    (SSE2)
_mm_packs_epi16
Unisce 4 registri da 32bit con segno (totale 8 x 32it) in uno da 8 x 16 bit, saturando (SSE2)
_mm_packs_epi32

Unisce 2 registri da 16bit senza segno (totale 16 x 16bit) in uno da 16 x 8 bit, saturando (SSE2)
_mm_packus_epi16






      

* queste funzioni sono composite: non hanno un corrispondente istruzione assembly ma una serie.


Last Update: 15 NOV 2013