Articles for Developer by Paolo Medici [PMX.it]. See License on bottom of this page.

Now there is a web page, hosted by Intel, with all the intrinsic

http://software.intel.com/sites/landingpage/IntrinsicsGuide/

It is the best choice start from this page.

Data Types

The __m64 data type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.

The __m128 data type is used to represent the contents of a Streaming SIMD Extension register used by the Streaming SIMD Extension intrinsics. Conventionally, the __m128 data type can hold four 32-bit floating-point values, while the __m128d data type can hold two 64-bit floating-point values, and the __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.

The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values, while the __m256d data type can hold four 64-bit double precision floating-point values, and the __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values [1]

Sintax for SSE*

_mm_<intrin op>_<suffix>

suffix is composed by a p (all the elements), ep (extended packed), s (single, the first element of the vector) and an additional string
s (float), d (double), i128 (integer 128bit), i64 (integer 64bit), u64 (unsigned 64bit), i32 (integer 32bit), u32 (unsigned 32bit), i16 (integer 16bit), u16 (unsigned 16bit), i8 (integer 8bit), u8 (unsigned 8bit)

Valid combination are

	s	d	i128	i64	u64	i32	u32	i16	u16	i8	u8
p	ps	pd				pi32	pu32	pi16	pu16	pi8	pu8
ep				epi64		epi32	epu32	epi16	epu16	epi8	epu8
s	ss		si128	si64

This document covers intrinsics of following set

SSE
SSE2
SSE3
SSSE3
SSE4 (SSE4.1)

Sintax for AVX*

_mm256_<intrin_op>_<suffix>

Intrinsics

Bit Operations | Operazioni su bit

SSE (packed floating point), SSE2 (128bit operation)

_mm_and_ps 	(SSE)
_mm_and_pd 	(SSE2)
_mm_and_si128 	(SSE2)
_mm256_and_ps 	(AVX)
_mm256_and_pd 	(AVX) 

_mm_andnot_ps 	(SSE)
_mm_andnot_pd 	(SSE2)
_mm_andnot_si128 	(SSE2)
_mm256_andnot_ps 	(AVX)
_mm256_andnot_ps 	(AVX)


_mm_or_ps 	(SSE)
_mm_or_pd 	(SSE2)
_mm_or_si128 	(SSE2)
_mm256_or_ps 	(AVX)
_mm256_or_pd 	(AVX)


_mm_xor_ps 	(SSE)
_mm_xor_pd 	(SSE2) 
_mm_xor_si128 	(SSE2)
_mm256_xor_ps 	(AVX)
_mm256_xor_pd 	(AVX)

Comparison | Confronto

SSE, SSE2

Il confronto genera in output un registro con 0xffffffff se il confronto ha avuto successo o 0x0 altrimenti.

_mm_cmpeq_ss	(SSE)
_mm_cmpeq_ps	(SSE)

_mm_cmpeq_pd	(SSE2)

_mm_cmpeq_epi8	(SSE2)
_mm_cmpeq_epi16	(SSE2)
_mm_cmpeq_epi32	(SSE2)

al posto di <eq> si puo' usare uno dei tanti operatori di confronto.

Il confronto genera in output un registro con 0x1 se il confronto ha avuto successo o 0x0 altrimenti.

_mm_comieq_sd (SSE2)

Sum | Somma

SSE2, SSE3, AVX

_mm_add_sd (SSE2)
_mm_add_pd (SSE2)

_mm256_add_ps (AVX)
_mm256_add_pd (AVX)

_mm_add_epi8 (SSE2)

_mm_add_epi16 (SSE2)
_mm_add_si64 (SSE2)
_mm_add_epi64 (SSE2)

Sum With Saturation | Somma Con Saturazione

_mm_adds_...
_mm_adds_epi8 (SSE2)
_mm_adds_epi16 (SSE2)
_mm_adds_epu8 (SSE2)
_mm_adds_epu16 (SSE2)

Subtraction | Sottrazione

SSE2, SSE3, AVX

_mm_sub_...      
_mm_sub_sd (SSE2)
_mm_sub_pd (SSE2)

_mm256_sub_pd (AVX)
_mm256_sub_ps (AVX)

Subtraction With Saturation | Sottrazione Con Saturazione

_mm_adds_...
_mm_adds_epi8 (SSE2)
_mm_adds_epi16 (SSE2)
_mm_adds_epu8 (SSE2)
_mm_adds_epu16 (SSE2)

_mm_subs_...

Sum and Subtraction | Somma e Sottrazione

SSE3, AVX

Somma degli elementi pari, sottrazione elementi dispari (SSE3):

_mm_addsub_ps (SSE3)
_mm_addsub_pd (SSE3) 

_mm256_addsub_ps (AVX)
_mm256_addsub_pd (AVX)

Horizontal Sum | Somma Orizzontale

SSE3, SSSE3, AVX

out_i = a_{2*i} + a_{2*i+1}

64 bit register:

_mm_hadd_pi32 (SSSE3)
_mm_hadd_pi16 (SSSE3)

128 bit register:

_mm_hadd_epi16 (SSSE3)      
_mm_hadd_epi32 (SSSE3)

_mm_hadd_ps (SSE3)

256 bit register:

_mm256_hadd_pd 	(AVX)
_mm256_hadd_ps 	(AVX)

Horizontal Sum with Saturation | Somma Orizzontale con Saturazione

_mm_hadds_epi16 (SSE3)

_mm_hsubs_epi16 (SSE3)

Horizontal Subtraction | Sottrazione Orizzontale

SSE3, SSSE3

out_i = a_{2*i} + a_{2*i+1}

64 bit register:

_mm_hsub_pi32 (SSSE3)
_mm_hsub_pi16 (SSSE3)

128 bit register:

_mm_hsub_epi16 (SSSE3)

_mm_hsub_epi32 (SSSE3)

_mm_hsub_ps (SSE3)

256 bit register:

_mm256_hsub_pd	(AVX)
_mm256_hsub_ps	(AVX)

Horizontal Subtraction with Saturation | Sottrazione Orizzontale con Saturazione

 
_mm_hsubs_pi16 (SSSE3)
_mm_hadds_pi16 (SSSE3)

Multiplication | Moltiplicazione

SSE2, SSE4


_mm_mul_su32 (SSE2)
_mm_mul_epu32 (SSE2)

_mm_mul_epi32 (SSE4)

Moltiplicazione con troncamento dei bit alti:

_mm_mulllo_epi16 (SSE2)

_mm_mullo_epi32 (SSE4)

Multiplication and Horizontal Sum | Moltiplicazione e Somma orizzontale

SSE2, SSE3

out_i = a_{2*i}*b_{2*i} + a_{2*i+1}*b_{2*i+1}

8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Output 4 signed 32-bit integer results. In uscita 4 elementi da 32bit. (128bit registers)

_mm_madd_epi16 	(a,b)	(SSE2)

out_i = SATURATE ( a_{2*i}*b_{2*i} + a_{2*i+1}*b_{2*i+1} )

Input: 16 elements 8-bit (a: unsigned, b: signed) ad output eight 16-bit signed integers (128 bit registers)

_mm_maddubs_epi16 (a,b)	(SSSE3)

Input: 8 elementi da 8-bit (a: unsigned, b: signed) , output four 16-bit signed integers (64 bit registers)

_mm_maddubs_pi16 (a,b)	(SSSE3)

Division | Divisione

_mm256_div_pd (AVX)
_mm256_div_ps (AVX)

Reciprocal | Reciproco

_mm256_rcp_ps (AVX)

Dot Product | Prodotto Scalare

_mm256_dp_ps (AVX)

Square Root | Radice Quadrata

_mm256_sqrt_pd (AVX)
_mm256_sqrt_ps (AVX)

1/Square Root | Reciproco della Radice Quadrata

_mm256_rsqrt_ps (AVX)

Average | Media

SSE, SSE2

out_i = (a_i + b_1 + 1)>>1;

64bit register

_mm_avg_pu8 (SSE)
_mm_avg_pu16 (SSE)

128 bit register

_mm_avg_epu8 (SSE2)
_mm_avg_epu16 (SSE2)

Sum of Absolute Value | Somma in valore assoluto

SSE, SSE2
out_0 = abs(a_0 - b_0) + ... abs(a_n - b_n)

_mm_sad_pu8 (SSE)

_mm_sad_epu8 (SSE2)

In questo caso vengono fatte due SAD da 8 elementi e il risultato memorizzato in R0 e R4

Absolute Value | Valore Assoluto

SSE3

out_i = abs(in_i)

128 bit register:

_mm_abs_epi8 	(SSE3)
_mm_abs_epi16 	(SSE3)
_mm_abs_epi32 	(SSE3)

64 bit register:

_mm_abs_pi8 	(SSE3)
_mm_abs_pi16 	(SSE3)
_mm_abs_pi32 	(SSE3)

Negazione numero

SSE3

out = -in

_mm_sign_epi8
_mm_sign_epi16
_mm_sign_epi32
_mm_sign_pi8
_mm_sign_pi16
_mm_sign_pi32

Rounding | Arrotondamento

SSE 4.1

out = ceil(in)

_mm_ceil_pd
_mm_ceil_ps
_mm_ceil_sd
_mm_ceil_ss

out = floor(in)

_mm_floor_pd
_mm_floor_ps
_mm_floor_sd
_mm_floor_ss

out = round(in)

_mm_round_pd
_mm_round_ps
_mm_round_sd
_mm_round_ss

Max | Massimo

SSE, SSE2, SSE4.1

out_i = max(a_i, b_i)

_mm_max_sd (SSE2)
_mm_max_pd (SSE2)

_mm_max_pi16 (SSE)
_mm_max_pu8 (SSE)

_mm_max_epi16 (SSE2)
_mm_max_epu8 (SSE2)

_mm_max_epi8 (SSE4.1)
_mm_max_epi32 (SSE4.1)
_mm_max_epu16 (SSE4.1)
_mm_max_epu32 (SSE4.1)

Min | Minimo

out_i = min(a_i, b_i)

_mm_min_sd (SSE2)
_mm_min_pd (SSE2)

_mm_min_pi16 (SSE)
_mm_min_pu8 (SSE)

_mm_min_epi16 (SSE2)
_mm_min_epu8 (SSE2)

_mm_min_epi8 (SSE4.1)
_mm_min_epi32 (SSE4.1)
_mm_min_epu16 (SSE4.1)
_mm_min_epu32 (SSE4.1)

_mm_minpos_epu16 (SSE4.1)

Loading

SSE, SSE2

carica 1 valore e lo replica in tutte le altre parole

_mm_load1_ps (SSE)

_mm_load1_pd (SSE2)

carica 4 valori, allineamento memoria

_mm_load_ps (SSE)

_mm_load_pd (SSE2)

carica 4 valori, memoria non allineata

_mm_loadu_ps (SSE)

_mm_loadu_pd (SSE2)

carica 4 valori e ne inverte l'ordine

_mm_loadr_ps (SSE)

carica 64bit e li mette nella parte bassa di un registro a 128 (SSE2)

_mm_loadl_epi64

carica 128bit (aligned memory) (SSE2)

_mm_load_si128

Imposta un valore su tutte le 4 parole

_mm_set1_ps (SSE) *
_mm_set1_pd (SSE2) *

Crea un registro di tutti zero:

_mm_setzero_si64 (SSE2)
_mm_setzero_si128 (SSE2)
_mm_setzero_ps (SSE2)
_mm_setzero_pd (SSE2)

Packing and Unpacking

SSE2

Intrammezza gli 8byte bassi (alti) di a con gli 8byte bassi (alti) di b (SSE2)

_mm_unpacklo_epi8(a,b)
_mm_unpackhi_epi8(a,b)

Intrammezza i 4 16bit bassi di a con i 4 16bit bassi di b (SSE2)

_mm_unpacklo_epi16(a,b)
_mm_unpackhi_epi16(a,b)

Intrammezza i 2 32bit bassi di a con i 2 32bit bassi di b (SSE2)

_mm_unpacklo_epi32(a,b)
_mm_unpackhi_epi32(a,b)

unisce i 64 bit bassi di a con i 64bit bassi di b (SSE2)

_mm_unpacklo_ep64(a,b)
_mm_unpackhi_ep64(a,b)

Unisce 2 registri da 16bit con segno (totale 16 x 16bit) in uno da 16 x 8 bit, saturando (SSE2)

_mm_packs_epi16

Unisce 4 registri da 32bit con segno (totale 8 x 32it) in uno da 8 x 16 bit, saturando (SSE2)

_mm_packs_epi32

Unisce 2 registri da 16bit senza segno (totale 16 x 16bit) in uno da 16 x 8 bit, saturando (SSE2)

_mm_packus_epi16

* queste funzioni sono composite: non hanno un corrispondente istruzione assembly ma una serie.