Name	Description
MK_INSERTPS_NDX(int, int, int)	Helper macro to create index-parameter value for insert_ps
blend_epi16(v128, v128, int)	Blend packed 16-bit integers from "a" and "b" using control mask "imm8", and store the results in "dst".
blend_pd(v128, v128, int)	Blend packed double-precision (64-bit) floating-point elements from "a" and "b" using control mask "imm8", and store the results in "dst".
blend_ps(v128, v128, int)	Blend packed single-precision (32-bit) floating-point elements from "a" and "b" using control mask "imm8", and store the results in "dst".
blendv_epi8(v128, v128, v128)	Blend packed 8-bit integers from "a" and "b" using "mask", and store the results in "dst".
blendv_pd(v128, v128, v128)	Blend packed double-precision (64-bit) floating-point elements from "a" and "b" using "mask", and store the results in "dst".
blendv_ps(v128, v128, v128)	Blend packed single-precision (32-bit) floating-point elements from "a" and "b" using "mask", and store the results in "dst".
ceil_pd(v128)	Round the packed double-precision (64-bit) floating-point elements in "a" up to an integer value, and store the results as packed double-precision floating-point elements in "dst".
ceil_ps(v128)	Round the packed single-precision (32-bit) floating-point elements in "a" up to an integer value, and store the results as packed single-precision floating-point elements in "dst".
ceil_sd(v128, v128)	Round the lower double-precision (64-bit) floating-point element in "b" up to an integer value, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".
ceil_ss(v128, v128)	Round the lower single-precision (32-bit) floating-point element in "b" up to an integer value, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".
cmpeq_epi64(v128, v128)	Compare packed 64-bit integers in "a" and "b" for equality, and store the results in "dst".
cvtepi16_epi32(v128)	Sign extend packed 16-bit integers in "a" to packed 32-bit integers, and store the results in "dst".
cvtepi16_epi64(v128)	Sign extend packed 16-bit integers in "a" to packed 64-bit integers, and store the results in "dst".
cvtepi32_epi64(v128)	Sign extend packed 32-bit integers in "a" to packed 64-bit integers, and store the results in "dst".
cvtepi8_epi16(v128)	Sign extend packed 8-bit integers in "a" to packed 16-bit integers, and store the results in "dst".
cvtepi8_epi32(v128)	Sign extend packed 8-bit integers in "a" to packed 32-bit integers, and store the results in "dst".
cvtepi8_epi64(v128)	Sign extend packed 8-bit integers in the low 8 bytes of "a" to packed 64-bit integers, and store the results in "dst".
cvtepu16_epi32(v128)	Zero extend packed unsigned 16-bit integers in "a" to packed 32-bit integers, and store the results in "dst".
cvtepu16_epi64(v128)	Zero extend packed unsigned 16-bit integers in "a" to packed 64-bit integers, and store the results in "dst".
cvtepu32_epi64(v128)	Zero extend packed unsigned 32-bit integers in "a" to packed 64-bit integers, and store the results in "dst".
cvtepu8_epi16(v128)	Zero extend packed unsigned 8-bit integers in "a" to packed 16-bit integers, and store the results in "dst".
cvtepu8_epi32(v128)	Zero extend packed unsigned 8-bit integers in "a" to packed 32-bit integers, and store the results in "dst".
cvtepu8_epi64(v128)	Zero extend packed unsigned 8-bit integers in the low 8 byte sof "a" to packed 64-bit integers, and store the results in "dst".
dp_pd(v128, v128, int)	Conditionally multiply the packed double-precision (64-bit) floating-point elements in "a" and "b" using the high 4 bits in "imm8", sum the four products, and conditionally store the sum in "dst" using the low 4 bits of "imm8".
dp_ps(v128, v128, int)	Conditionally multiply the packed single-precision (32-bit) floating-point elements in "a" and "b" using the high 4 bits in "imm8", sum the four products, and conditionally store the sum in "dst" using the low 4 bits of "imm8".
extract_epi32(v128, int)	Extract a 32-bit integer from "a", selected with "imm8", and store the result in "dst".
extract_epi64(v128, int)	Extract a 64-bit integer from "a", selected with "imm8", and store the result in "dst".
extract_epi8(v128, int)	Extract an 8-bit integer from "a", selected with "imm8", and store the result in the lower element of "dst".
extract_ps(v128, int)	Extract a single-precision (32-bit) floating-point element from "a", selected with "imm8", and store the result in "dst".
extractf_ps(v128, int)	Extract a single-precision (32-bit) floating-point element from "a", selected with "imm8", and store the result in "dst" (as a float).
floor_pd(v128)	Round the packed double-precision (64-bit) floating-point elements in "a" down to an integer value, and store the results as packed double-precision floating-point elements in "dst".
floor_ps(v128)	Round the packed single-precision (32-bit) floating-point elements in "a" down to an integer value, and store the results as packed single-precision floating-point elements in "dst".
floor_sd(v128, v128)	Round the lower double-precision (64-bit) floating-point element in "b" down to an integer value, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".
floor_ss(v128, v128)	Round the lower single-precision (32-bit) floating-point element in "b" down to an integer value, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".
insert_epi32(v128, int, int)	Copy "a" to "dst", and insert the 32-bit integer "i" into "dst" at the location specified by "imm8".
insert_epi64(v128, long, int)	Copy "a" to "dst", and insert the 64-bit integer "i" into "dst" at the location specified by "imm8".
insert_epi8(v128, byte, int)	Copy "a" to "dst", and insert the lower 8-bit integer from "i" into "dst" at the location specified by "imm8".
insert_ps(v128, v128, int)	Copy "a" to "tmp", then insert a single-precision (32-bit) floating-point element from "b" into "tmp" using the control in "imm8". Store "tmp" to "dst" using the mask in "imm8" (elements are zeroed out when the corresponding bit is set).
max_epi32(v128, v128)	Compare packed 32-bit integers in "a" and "b", and store packed maximum values in "dst".
max_epi8(v128, v128)	Compare packed 8-bit integers in "a" and "b", and store packed maximum values in "dst".
max_epu16(v128, v128)	Compare packed unsigned 16-bit integers in "a" and "b", and store packed maximum values in "dst".
max_epu32(v128, v128)	Compare packed unsigned 32-bit integers in "a" and "b", and store packed maximum values in "dst".
min_epi32(v128, v128)	Compare packed 32-bit integers in "a" and "b", and store packed minimum values in "dst".
min_epi8(v128, v128)	Compare packed 8-bit integers in "a" and "b", and store packed minimum values in "dst".
min_epu16(v128, v128)	Compare packed unsigned 16-bit integers in "a" and "b", and store packed minimum values in "dst".
min_epu32(v128, v128)	Compare packed unsigned 32-bit integers in "a" and "b", and store packed minimum values in "dst".
minpos_epu16(v128)	Horizontally compute the minimum amongst the packed unsigned 16-bit integers in "a", store the minimum and index in "dst", and zero the remaining bits in "dst".
mpsadbw_epu8(v128, v128, int)	Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in "a" compared to those in "b", and store the 16-bit results in "dst".
mul_epi32(v128, v128)	Multiply the low 32-bit integers from each packed 64-bit element in "a" and "b", and store the signed 64-bit results in "dst".
mullo_epi32(v128, v128)	Multiply the packed 32-bit integers in "a" and "b", producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in "dst".
packus_epi32(v128, v128)	Convert packed 32-bit integers from "a" and "b" to packed 16-bit integers using unsigned saturation, and store the results in "dst".
round_pd(v128, int)	Round the packed double-precision (64-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed double-precision floating-point elements in "dst".
round_ps(v128, int)	Round the packed single-precision (32-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed single-precision floating-point elements in "dst".
round_sd(v128, v128, int)	Round the lower double-precision (64-bit) floating-point element in "b" using the "rounding" parameter, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".
round_ss(v128, v128, int)	Round the lower single-precision (32-bit) floating-point element in "b" using the "rounding" parameter, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".
stream_load_si128(void*)	Load 128-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
test_all_ones(v128)	Compute the bitwise NOT of "a" and then AND with a 128-bit vector containing all 1's, and return 1 if the result is zero, otherwise return 0.>
test_all_zeros(v128, v128)	Compute the bitwise AND of 128 bits (representing integer data) in "a" and "mask", and return 1 if the result is zero, otherwise return 0.
test_mix_ones_zeroes(v128, v128)	Compute the bitwise AND of 128 bits (representing integer data) in "a" and "mask", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "mask", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return 1 if both the "ZF" and "CF" values are zero, otherwise return 0.
testc_si128(v128, v128)	Compute the bitwise AND of 128 bits (representing integer data) in "a" and "b", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "b", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return the "CF" value.
testnzc_si128(v128, v128)	Compute the bitwise AND of 128 bits (representing integer data) in "a" and "b", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "b", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return 1 if both the "ZF" and "CF" values are zero, otherwise return 0.
testz_si128(v128, v128)	Compute the bitwise AND of 128 bits (representing integer data) in "a" and "b", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "b", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return the "ZF" value.