Now there is a web page, hosted by Intel, with all the intrinsic
http://software.intel.com/sites/landingpage/IntrinsicsGuide/
It is the best choice start from this page.
Data Types
The __m64 data
type can hold eight 8-bit values, four 16-bit values, two 32-bit
values, or one 64-bit value.
The __m128 data
type is used to represent the contents of a Streaming SIMD
Extension register used by the Streaming SIMD Extension
intrinsics. Conventionally, the
__m128
data type can hold four 32-bit floating-point values, while the __m128d data type can hold two
64-bit floating-point values, and the __m128i data type can hold sixteen
8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.
The __m256
data type is used to represent the contents of the
extended SSE register - the YMM
register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight
32-bit floating-point values, while the __m256d data
type can hold four 64-bit double precision floating-point
values, and the __m256i
data type can hold thirty-two 8-bit, sixteen 16-bit,
eight 32-bit, or four 64-bit integer values [1]
Sintax for SSE*
_mm_<intrin op>_<suffix>
suffix is composed by a p
(all the elements), ep
(extended packed), s
(single, the first element of the vector) and an additional
string
s (float), d (double), i128 (integer 128bit), i64 (integer 64bit), u64 (unsigned 64bit), i32 (integer
32bit), u32 (unsigned 32bit), i16 (integer 16bit), u16 (unsigned 16bit), i8 (integer 8bit), u8 (unsigned 8bit)
s |
d |
i128 |
i64 |
u64 |
i32 |
u32 |
i16 |
u16 |
i8 |
u8 |
|
p |
ps |
pd |
pi32 |
pu32 |
pi16 |
pu16 |
pi8 |
pu8 |
|||
ep |
epi64 |
epi32 |
epu32 |
epi16 |
epu16 |
epi8 |
epu8 |
||||
s |
ss |
si128 |
si64 |
This document covers intrinsics of following
set
SSE
SSE2
SSE3
SSSE3
SSE4 (SSE4.1)
Sintax for AVX*
_mm256_<intrin_op>_<suffix>
Intrinsics
Bit Operations | Operazioni su bit
SSE (packed floating point), SSE2 (128bit operation)
_mm_and_ps (SSE)
_mm_and_pd (SSE2)
_mm_and_si128 (SSE2)
_mm256_and_ps (AVX)
_mm256_and_pd (AVX)
_mm_andnot_ps (SSE)
_mm_andnot_pd (SSE2)
_mm_andnot_si128 (SSE2) _mm256_andnot_ps (AVX)
_mm256_andnot_ps (AVX)
_mm_or_ps (SSE)
_mm_or_pd (SSE2)
_mm_or_si128 (SSE2)
_mm256_or_ps (AVX)
_mm256_or_pd (AVX)
_mm_xor_ps (SSE)
_mm_xor_pd (SSE2)
_mm_xor_si128 (SSE2)
_mm256_xor_ps (AVX)
_mm256_xor_pd (AVX)
Comparison | Confronto
SSE, SSE2
Il confronto genera in output un registro con 0xffffffff se il
confronto ha avuto successo o 0x0 altrimenti.
_mm_cmpeq_ss (SSE) _mm_cmpeq_ps (SSE)
_mm_cmpeq_pd (SSE2)
_mm_cmpeq_epi8 (SSE2)
_mm_cmpeq_epi16 (SSE2)
_mm_cmpeq_epi32 (SSE2)
al posto di <eq>
si puo' usare uno dei tanti operatori di confronto.
Il confronto genera in output un registro con 0x1 se il
confronto ha avuto successo o 0x0 altrimenti.
_mm_comieq_sd (SSE2)
Sum | Somma
SSE2, SSE3, AVX_mm_add_sd (SSE2)
_mm_add_pd (SSE2)
_mm256_add_ps (AVX)
_mm256_add_pd (AVX)
_mm_add_epi8 (SSE2)
_mm_add_epi16 (SSE2) _mm_add_si64 (SSE2) _mm_add_epi64 (SSE2)
Sum With Saturation | Somma Con Saturazione
_mm_adds_...
_mm_adds_epi8 (SSE2)
_mm_adds_epi16 (SSE2)
_mm_adds_epu8 (SSE2)
_mm_adds_epu16 (SSE2)
Subtraction | Sottrazione
SSE2, SSE3, AVX
_mm_sub_... _mm_sub_sd (SSE2)
_mm_sub_pd (SSE2)
_mm256_sub_pd (AVX)
_mm256_sub_ps (AVX)
Subtraction With Saturation | Sottrazione Con Saturazione
_mm_adds_...
_mm_adds_epi8 (SSE2)
_mm_adds_epi16 (SSE2)
_mm_adds_epu8 (SSE2)
_mm_adds_epu16 (SSE2)
_mm_subs_...
Sum and Subtraction | Somma e Sottrazione
SSE3, AVX
Somma degli elementi pari, sottrazione elementi dispari (SSE3):
_mm_addsub_ps (SSE3)
_mm_addsub_pd (SSE3)
_mm256_addsub_ps (AVX)
_mm256_addsub_pd (AVX)
Horizontal Sum | Somma Orizzontale
SSE3, SSSE3, AVX
out_i = a_{2*i} + a_{2*i+1}
64 bit register:
_mm_hadd_pi32 (SSSE3) _mm_hadd_pi16 (SSSE3)
128 bit register:
_mm_hadd_epi16 (SSSE3)256 bit register:
_mm_hadd_epi32 (SSSE3)
_mm_hadd_ps (SSE3)
_mm256_hadd_pd (AVX)
_mm256_hadd_ps (AVX)
Horizontal Sum with Saturation | Somma Orizzontale con Saturazione
_mm_hadds_epi16 (SSE3)
_mm_hsubs_epi16 (SSE3)
Horizontal Subtraction | Sottrazione Orizzontale
SSE3, SSSE3
out_i = a_{2*i} + a_{2*i+1}
64 bit register:
_mm_hsub_pi32 (SSSE3) _mm_hsub_pi16 (SSSE3)
128 bit register:
_mm_hsub_epi16 (SSSE3)256 bit register:
_mm_hsub_epi32 (SSSE3)
_mm_hsub_ps (SSE3)
_mm256_hsub_pd (AVX)
_mm256_hsub_ps (AVX)
Horizontal Subtraction with Saturation | Sottrazione Orizzontale con Saturazione
_mm_hsubs_pi16 (SSSE3) _mm_hadds_pi16 (SSSE3)
Multiplication | Moltiplicazione
SSE2, SSE4
_mm_mul_su32 (SSE2) _mm_mul_epu32 (SSE2)
_mm_mul_epi32 (SSE4)
Moltiplicazione con troncamento dei bit alti:
_mm_mulllo_epi16 (SSE2)
_mm_mullo_epi32 (SSE4)
Multiplication and Horizontal Sum | Moltiplicazione e Somma orizzontale
SSE2, SSE3
out_i = a_{2*i}*b_{2*i} + a_{2*i+1}*b_{2*i+1}
8 signed 16-bit integers from a by
the 8 signed 16-bit integers from b.
Output 4 signed 32-bit integer results. In uscita 4 elementi da
32bit. (128bit registers)
_mm_madd_epi16 (a,b) (SSE2)
out_i = SATURATE ( a_{2*i}*b_{2*i} + a_{2*i+1}*b_{2*i+1} )
Input: 16 elements 8-bit (a: unsigned, b: signed) ad output eight 16-bit signed integers (128 bit registers)
_mm_maddubs_epi16 (a,b) (SSSE3)
Input: 8 elementi da 8-bit (a: unsigned, b: signed) , output four 16-bit signed integers (64 bit registers)
_mm_maddubs_pi16 (a,b) (SSSE3)
Division | Divisione
_mm256_div_pd (AVX)
_mm256_div_ps (AVX)
Reciprocal | Reciproco
_mm256_rcp_ps (AVX)
Dot Product | Prodotto Scalare
_mm256_dp_ps (AVX)
Square Root | Radice Quadrata
_mm256_sqrt_pd (AVX)
_mm256_sqrt_ps (AVX)
1/Square Root | Reciproco della Radice Quadrata
_mm256_rsqrt_ps (AVX)
Average | Media
SSE, SSE2out_i = (a_i + b_1 + 1)>>1;
64bit register
_mm_avg_pu8 (SSE) _mm_avg_pu16 (SSE)
128 bit register
_mm_avg_epu8 (SSE2) _mm_avg_epu16 (SSE2)
Sum of Absolute Value | Somma in valore assoluto
SSE, SSE2out_0 = abs(a_0 - b_0) + ... abs(a_n - b_n)
_mm_sad_pu8 (SSE)
_mm_sad_epu8 (SSE2)In questo caso vengono fatte due SAD da 8 elementi e il risultato memorizzato in R0 e R4
Absolute Value | Valore Assoluto
SSE3
out_i = abs(in_i)
_mm_abs_epi8 (SSE3) _mm_abs_epi16 (SSE3) _mm_abs_epi32 (SSE3)
64 bit register:
_mm_abs_pi8 (SSE3) _mm_abs_pi16 (SSE3) _mm_abs_pi32 (SSE3)
Negazione numero
SSE3
out = -in
_mm_sign_epi8
_mm_sign_epi16
_mm_sign_epi32
_mm_sign_pi8
_mm_sign_pi16
_mm_sign_pi32
Rounding | Arrotondamento
SSE 4.1
out = ceil(in)
_mm_ceil_pd
_mm_ceil_ps
_mm_ceil_sd
_mm_ceil_ss
out = floor(in)
_mm_floor_pd
_mm_floor_ps
_mm_floor_sd
_mm_floor_ss
out = round(in)
_mm_round_pd
_mm_round_ps
_mm_round_sd
_mm_round_ss
Max | Massimo
SSE, SSE2, SSE4.1
out_i = max(a_i, b_i)
_mm_max_sd (SSE2)
_mm_max_pd (SSE2)
_mm_max_pi16 (SSE)
_mm_max_pu8 (SSE)
_mm_max_epi16 (SSE2)
_mm_max_epu8 (SSE2)
_mm_max_epi8 (SSE4.1)
_mm_max_epi32 (SSE4.1)
_mm_max_epu16 (SSE4.1)
_mm_max_epu32 (SSE4.1)
Min | Minimo
out_i = min(a_i, b_i)_mm_min_sd (SSE2)
_mm_min_pd (SSE2)
_mm_min_pi16 (SSE)
_mm_min_pu8 (SSE)
_mm_min_epi16 (SSE2)
_mm_min_epu8 (SSE2)
_mm_min_epi8 (SSE4.1)
_mm_min_epi32 (SSE4.1)
_mm_min_epu16 (SSE4.1)
_mm_min_epu32 (SSE4.1)
_mm_minpos_epu16 (SSE4.1)
Loading
SSE, SSE2
carica 1 valore e lo replica in tutte le altre parole
_mm_load1_ps (SSE)
_mm_load1_pd (SSE2)
carica 4 valori, allineamento memoria
_mm_load_ps (SSE)
_mm_load_pd (SSE2)
carica 4 valori, memoria non allineata
_mm_loadu_ps (SSE)
_mm_loadu_pd (SSE2)
carica 4 valori e ne inverte l'ordine
_mm_loadr_ps (SSE)
carica 64bit e li mette nella parte bassa di un registro a 128 (SSE2)
_mm_loadl_epi64
carica 128bit (aligned memory) (SSE2)
_mm_load_si128Imposta un valore su tutte le 4 parole
_mm_set1_ps (SSE) *
_mm_set1_pd (SSE2) *
Crea un registro di tutti zero:
_mm_setzero_si64 (SSE2)
_mm_setzero_si128 (SSE2)
_mm_setzero_ps (SSE2)
_mm_setzero_pd (SSE2)
Packing and Unpacking
SSE2Intrammezza gli 8byte bassi (alti) di a con gli 8byte bassi (alti) di b (SSE2)
_mm_unpacklo_epi8(a,b)Intrammezza i 4 16bit bassi di a con i 4 16bit bassi di b (SSE2)
_mm_unpackhi_epi8(a,b)
_mm_unpacklo_epi16(a,b)Intrammezza i 2 32bit bassi di a con i 2 32bit bassi di b (SSE2)
_mm_unpackhi_epi16(a,b)
_mm_unpacklo_epi32(a,b) _mm_unpackhi_epi32(a,b)unisce i 64 bit bassi di a con i 64bit bassi di b (SSE2)
_mm_unpacklo_ep64(a,b) _mm_unpackhi_ep64(a,b)
Unisce 2 registri da 16bit con segno (totale 16 x 16bit) in uno da 16 x 8 bit, saturando (SSE2)
_mm_packs_epi16Unisce 4 registri da 32bit con segno (totale 8 x 32it) in uno da 8 x 16 bit, saturando (SSE2)
_mm_packs_epi32
Unisce 2 registri da 16bit senza segno (totale 16 x 16bit) in uno da 16 x 8 bit, saturando (SSE2)
_mm_packus_epi16
* queste funzioni sono composite: non hanno un corrispondente
istruzione assembly ma una serie.