Topic 18: SIMD Instructions (SSE/AVX)

Overview

SIMD (Single Instruction, Multiple Data) allows processing multiple data elements simultaneously with a single instruction. Modern x86-64 processors support several SIMD instruction sets: SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, and AVX-512.

// C equivalent - scalar processing:
for (int i = 0; i < 4; ++i) {
    result[i] = a[i] + b[i];  // 4 separate additions
}

// SIMD - vectorized processing:
// One instruction processes 4 floats at once!
// result_vector = a_vector + b_vector;

Performance Gain: Up to 4x-16x speedup for compatible workloads!

SIMD Registers

XMM Registers (SSE): 128-bit

16 registers: XMM0-XMM15

XMM0: [127 -------- 64][63 -------- 0]
      |    64-bit     ||    64-bit    |
      | 4x float      || 2x double    |
      | 4x int32      || 2x int64     |

Data Layouts:

Type	Count	Each Element
float	4	32-bit single-precision
double	2	64-bit double-precision
int8	16	8-bit integer
int16	8	16-bit integer
int32	4	32-bit integer
int64	2	64-bit integer

YMM Registers (AVX): 256-bit

16 registers: YMM0-YMM15 (extend XMM0-XMM15)

YMM0: [255----192][191----128][127----64][63-----0]
      |  64-bit  ||  64-bit  ||  64-bit ||  64-bit|
      |       8x float        ||   4x double      |

ZMM Registers (AVX-512): 512-bit

32 registers: ZMM0-ZMM31 (extend YMM0-YMM31)

Note: We’ll focus on SSE and AVX, as they’re most widely supported.

SSE Instructions

Moving Data

; C equivalent:
; float a[4] = {1.0, 2.0, 3.0, 4.0};
; float b[4];
; memcpy(b, a, sizeof(a));

section .data
    align 16
    floats dd 1.0, 2.0, 3.0, 4.0

section .bss
    align 16
    result resd 4

section .text
    ; Load 4 floats at once
    movaps xmm0, [floats]   ; Aligned load (faster)
    
    ; Store 4 floats at once
    movaps [result], xmm0   ; Aligned store
    
    ; Unaligned variants (slower but flexible)
    movups xmm1, [floats]   ; Unaligned load
    movups [result], xmm1   ; Unaligned store

Alignment: SSE requires 16-byte alignment for best performance. Use align 16 directive.

Arithmetic Operations

Packed Single-Precision (PS): 4x float

; C equivalent:
; for (int i = 0; i < 4; ++i) {
;     c[i] = a[i] + b[i];
;     d[i] = a[i] - b[i];
;     e[i] = a[i] * b[i];
;     f[i] = a[i] / b[i];
; }

section .data
    align 16
    a dd 1.0, 2.0, 3.0, 4.0
    b dd 5.0, 6.0, 7.0, 8.0

section .bss
    align 16
    c resd 4
    d resd 4
    e resd 4
    f resd 4

section .text
    movaps xmm0, [a]        ; Load a
    movaps xmm1, [b]        ; Load b
    
    movaps xmm2, xmm0
    addps xmm2, xmm1        ; xmm2 = a + b (4 additions)
    movaps [c], xmm2
    
    movaps xmm2, xmm0
    subps xmm2, xmm1        ; xmm2 = a - b (4 subtractions)
    movaps [d], xmm2
    
    movaps xmm2, xmm0
    mulps xmm2, xmm1        ; xmm2 = a * b (4 multiplications)
    movaps [e], xmm2
    
    movaps xmm2, xmm0
    divps xmm2, xmm1        ; xmm2 = a / b (4 divisions)
    movaps [f], xmm2

Packed Double-Precision (PD): 2x double

; C equivalent:
; for (int i = 0; i < 2; ++i) {
;     c[i] = a[i] + b[i];
; }

section .data
    align 16
    a dq 1.5, 2.5
    b dq 3.5, 4.5

section .text
    movapd xmm0, [a]        ; Load 2 doubles
    movapd xmm1, [b]
    addpd xmm0, xmm1        ; Add 2 doubles at once

Scalar Operations (SS/SD): Single element

; C equivalent:
; float result = a + b;

section .data
    a dd 1.5
    b dd 2.5

section .text
    movss xmm0, [a]         ; Load single float
    addss xmm0, [b]         ; Add single float
    ; xmm0[31:0] = result, xmm0[127:32] unchanged

Comparison Operations

; C equivalent:
; for (int i = 0; i < 4; ++i) {
;     mask[i] = (a[i] < b[i]) ? 0xFFFFFFFF : 0x00000000;
; }

movaps xmm0, [a]
movaps xmm1, [b]
cmpltps xmm0, xmm1          ; Compare less-than
; xmm0[i] = 0xFFFFFFFF if a[i] < b[i], else 0x00000000

Comparison Predicates:

Instruction	Condition	Description
`cmpeqps`	==	Equal
`cmpltps`	<	Less than
`cmpleps`	<=	Less or equal
`cmpneqps`	!=	Not equal
`cmpnltps`	>=	Not less than
`cmpnleps`	>	Not less or equal

Logical Operations

; C equivalent (bitwise):
; for (int i = 0; i < 4; ++i) {
;     result[i] = a[i] & b[i];  // Bitwise AND
; }

movaps xmm0, [a]
andps xmm0, [b]             ; Bitwise AND
orps xmm1, [c]              ; Bitwise OR
xorps xmm2, [d]             ; Bitwise XOR
andnps xmm3, [e]            ; Bitwise AND NOT

Practical Example: Array Addition

; add_arrays.asm - Add two float arrays using SSE

section .text
global add_arrays_simd

; C prototype:
; void add_arrays_simd(float *result, const float *a, const float *b, size_t count);
; Assumes count is multiple of 4, arrays are 16-byte aligned

add_arrays_simd:
    ; RDI = result, RSI = a, RDX = b, RCX = count
    
    xor rax, rax            ; i = 0
    
.loop:
    cmp rax, rcx
    jge .done
    
    ; Load 4 floats from a
    movaps xmm0, [rsi + rax*4]
    
    ; Add 4 floats from b
    addps xmm0, [rdx + rax*4]
    
    ; Store 4 results
    movaps [rdi + rax*4], xmm0
    
    add rax, 4              ; i += 4 (processed 4 elements)
    jmp .loop
    
.done:
    ret

// test.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern void add_arrays_simd(float *result, const float *a, const float *b, size_t count);

// Scalar version for comparison
void add_arrays_scalar(float *result, const float *a, const float *b, size_t count) {
    for (size_t i = 0; i < count; ++i) {
        result[i] = a[i] + b[i];
    }
}

int main() {
    const size_t count = 8;
    
    // Allocate aligned memory
    float *a = aligned_alloc(16, count * sizeof(float));
    float *b = aligned_alloc(16, count * sizeof(float));
    float *result = aligned_alloc(16, count * sizeof(float));
    
    // Initialize
    for (size_t i = 0; i < count; ++i) {
        a[i] = i + 1.0f;
        b[i] = (i + 1.0f) * 10.0f;
    }
    
    // SIMD version
    add_arrays_simd(result, a, b, count);
    
    printf("SIMD results:\n");
    for (size_t i = 0; i < count; ++i) {
        printf("%.1f ", result[i]);
    }
    printf("\n");
    
    free(a);
    free(b);
    free(result);
    return 0;
}

Advanced SIMD Techniques

Horizontal Operations

; C equivalent:
; float sum = a[0] + a[1] + a[2] + a[3];

section .data
    align 16
    values dd 1.0, 2.0, 3.0, 4.0

section .text
    movaps xmm0, [values]   ; [1, 2, 3, 4]
    
    ; Method 1: Hadd (SSE3)
    haddps xmm0, xmm0       ; [1+2, 3+4, 1+2, 3+4] = [3, 7, 3, 7]
    haddps xmm0, xmm0       ; [3+7, 3+7, 3+7, 3+7] = [10, 10, 10, 10]
    ; xmm0[0] now contains sum
    
    ; Method 2: Manual shuffling (faster)
    movaps xmm1, xmm0       ; Copy
    shufps xmm1, xmm1, 0xB1 ; Swap adjacent pairs
    addps xmm0, xmm1        ; Add pairs
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0x4E ; Swap high/low halves
    addps xmm0, xmm1        ; Final sum in all elements

Complete Array Sum Example

; sum_array.asm - Sum array of floats using SSE

section .text
global sum_array_simd

; C prototype:
; float sum_array_simd(const float *array, size_t count);

sum_array_simd:
    ; RDI = array, RSI = count
    
    xorps xmm0, xmm0        ; accumulator = 0
    xor rax, rax            ; i = 0
    
    ; Process 4 elements at a time
    mov rcx, rsi
    shr rcx, 2              ; count / 4
    jz .remainder
    
.loop:
    addps xmm0, [rdi + rax*4]   ; Add 4 floats to accumulator
    add rax, 4
    dec rcx
    jnz .loop
    
.remainder:
    ; Handle leftover elements (count % 4)
    mov rcx, rsi
    and rcx, 3
    jz .horizontal_sum
    
.remainder_loop:
    addss xmm0, [rdi + rax*4]
    inc rax
    dec rcx
    jnz .remainder_loop
    
.horizontal_sum:
    ; Sum the 4 values in xmm0 into xmm0[0]
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0xB1     ; Swap adjacent
    addps xmm0, xmm1
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0x4E     ; Swap high/low
    addps xmm0, xmm1
    
    ; Result in xmm0[0] (XMM0 is return register for float)
    ret

Shuffling and Permutation

; C equivalent:
; float result[4] = {a[2], a[0], a[3], a[1]};

movaps xmm0, [a]            ; [a0, a1, a2, a3]
shufps xmm0, xmm0, 0x9C     ; Shuffle: [a2, a0, a3, a1]
; Immediate 0x9C = 10 01 11 00 (binary)
;                  a2 a0 a3 a1 (indices)

Shuffle Control Byte:

Bits [1:0]: Select from source for element 0
Bits [3:2]: Select from source for element 1
Bits [5:4]: Select from source for element 2
Bits [7:6]: Select from source for element 3

Broadcast (Splat)

; C equivalent:
; float result[4] = {x, x, x, x};

section .data
    x dd 5.0

section .text
    movss xmm0, [x]         ; Load single value
    shufps xmm0, xmm0, 0    ; Broadcast to all elements
    ; xmm0 = [5.0, 5.0, 5.0, 5.0]

AVX Instructions

AVX extends SSE with 256-bit operations and non-destructive operations.

Non-Destructive Operations

; SSE (destructive):
movaps xmm0, [a]
addps xmm0, [b]             ; xmm0 = xmm0 + [b] (xmm0 overwritten)

; AVX (non-destructive):
vmovaps ymm0, [a]
vaddps ymm2, ymm0, [b]      ; ymm2 = ymm0 + [b] (ymm0 preserved!)

256-bit Operations

; C equivalent:
; for (int i = 0; i < 8; ++i) {
;     result[i] = a[i] + b[i];  // 8 floats at once!
; }

section .data
    align 32                        ; AVX requires 32-byte alignment
    a dd 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0
    b dd 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0

section .bss
    align 32
    result resd 8

section .text
    vmovaps ymm0, [a]           ; Load 8 floats
    vaddps ymm0, ymm0, [b]      ; Add 8 floats
    vmovaps [result], ymm0      ; Store 8 floats
    
    ; IMPORTANT: Clean up AVX state
    vzeroupper                  ; Clear upper 128 bits of YMM registers
    ret

Note: Always use vzeroupper or vzeroall before transitioning back to non-AVX code to avoid performance penalties.

Complete Real-World Example: Dot Product

; dot_product.asm - Compute dot product using SSE

section .text
global dot_product_simd

; C prototype:
; float dot_product_simd(const float *a, const float *b, size_t count);
; Assumes count is multiple of 4

dot_product_simd:
    ; RDI = a, RSI = b, RDX = count
    
    xorps xmm0, xmm0        ; sum = 0
    xor rax, rax            ; i = 0
    
.loop:
    cmp rax, rdx
    jge .horizontal_sum
    
    ; Load 4 elements from each array
    movaps xmm1, [rdi + rax*4]
    movaps xmm2, [rsi + rax*4]
    
    ; Multiply element-wise
    mulps xmm1, xmm2
    
    ; Add to accumulator
    addps xmm0, xmm1
    
    add rax, 4
    jmp .loop
    
.horizontal_sum:
    ; Sum the 4 partial sums in xmm0
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0xB1     ; [1, 0, 3, 2]
    addps xmm0, xmm1            ; [0+1, 1+0, 2+3, 3+2]
    
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0x4E     ; [2, 3, 0, 1]
    addps xmm0, xmm1            ; [sum, sum, sum, sum]
    
    ; Result in xmm0[0]
    ret

// test_dot.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

extern float dot_product_simd(const float *a, const float *b, size_t count);

float dot_product_scalar(const float *a, const float *b, size_t count) {
    float sum = 0.0f;
    for (size_t i = 0; i < count; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

int main() {
    const size_t count = 1024 * 1024;  // 1 million elements
    
    float *a = aligned_alloc(16, count * sizeof(float));
    float *b = aligned_alloc(16, count * sizeof(float));
    
    // Initialize
    for (size_t i = 0; i < count; ++i) {
        a[i] = 1.0f;
        b[i] = 2.0f;
    }
    
    // Benchmark scalar
    clock_t start = clock();
    float result_scalar = dot_product_scalar(a, b, count);
    clock_t end = clock();
    double time_scalar = (double)(end - start) / CLOCKS_PER_SEC;
    
    // Benchmark SIMD
    start = clock();
    float result_simd = dot_product_simd(a, b, count);
    end = clock();
    double time_simd = (double)(end - start) / CLOCKS_PER_SEC;
    
    printf("Scalar: %.2f (%.6f seconds)\n", result_scalar, time_scalar);
    printf("SIMD:   %.2f (%.6f seconds)\n", result_simd, time_simd);
    printf("Speedup: %.2fx\n", time_scalar / time_simd);
    
    free(a);
    free(b);
    return 0;
}

Integer SIMD Operations

; C equivalent:
; for (int i = 0; i < 4; ++i) {
;     result[i] = a[i] + b[i];  // 32-bit integers
; }

section .data
    align 16
    a dd 1, 2, 3, 4
    b dd 5, 6, 7, 8

section .bss
    align 16
    result resd 4

section .text
    movdqa xmm0, [a]        ; Load 4 int32s (aligned)
    paddd xmm0, [b]         ; Add 4 int32s
    movdqa [result], xmm0   ; Store 4 int32s

Integer Operations:

Instruction	Description
`paddb/w/d/q`	Add packed bytes/words/dwords/qwords
`psubb/w/d/q`	Subtract
`pmullw/d`	Multiply low (words/dwords)
`pmulhw/uw`	Multiply high (signed/unsigned words)
`pand/por/pxor`	Logical AND/OR/XOR
`psllw/d/q`	Shift left logical
`psrlw/d/q`	Shift right logical
`psraw/d`	Shift right arithmetic
`pcmpeqb/w/d`	Compare equal
`pcmpgtb/w/d`	Compare greater than

Performance Considerations

When to Use SIMD

Good candidates:

Array operations (add, multiply, etc.)
Vector math (graphics, physics)
Signal processing (audio, video)
Matrix operations
Image processing
Data parallel algorithms

Poor candidates:

Irregular control flow (many branches)
Data dependencies between iterations
Non-contiguous memory access
Small datasets (overhead dominates)

Alignment

; Aligned (fast):
section .data
    align 16
    data dd 1.0, 2.0, 3.0, 4.0
    
movaps xmm0, [data]     ; Fast aligned load

; Unaligned (slower):
movups xmm0, [data]     ; Slower unaligned load

Best Practice: Always align data to 16/32 bytes when possible.

Typical Speedups

Operation	SSE Speedup	AVX Speedup
Float add/sub	3-4x	6-8x
Float multiply	3-4x	6-8x
Float divide	2-3x	4-6x
Int32 add	3-4x	6-8x
Dot product	3-4x	6-8x

Actual speedup depends on:

Memory bandwidth
Data alignment
Loop overhead
CPU architecture

Common Patterns

Pattern 1: Conditional Selection

; C equivalent:
; for (int i = 0; i < 4; ++i) {
;     result[i] = (a[i] > b[i]) ? a[i] : b[i];  // Max
; }

movaps xmm0, [a]
maxps xmm0, [b]         ; Per-element maximum
movaps [result], xmm0

; Manual with blending:
movaps xmm0, [a]
movaps xmm1, [b]
cmpgtps xmm2, xmm0, xmm1    ; mask = (a > b)
andps xmm0, xmm2            ; a & mask
andnps xmm2, xmm1           ; b & ~mask
orps xmm0, xmm2             ; result = (a & mask) | (b & ~mask)

Pattern 2: AoS to SoA Conversion

// Array of Structures (AoS) - cache unfriendly for SIMD
struct Point { float x, y, z; };
Point points[1000];

// Structure of Arrays (SoA) - SIMD friendly
float x[1000], y[1000], z[1000];

; Process SoA data with SIMD
movaps xmm0, [x]        ; Load 4 x values
movaps xmm1, [y]        ; Load 4 y values
movaps xmm2, [z]        ; Load 4 z values
; Process 4 points in parallel!

Pattern 3: Fused Multiply-Add (FMA)

; C equivalent:
; for (int i = 0; i < 4; ++i) {
;     result[i] = a[i] * b[i] + c[i];
; }

; Without FMA (SSE):
movaps xmm0, [a]
mulps xmm0, [b]
addps xmm0, [c]

; With FMA (AVX2+):
vmovaps ymm0, [a]
vfmadd231ps ymm0, [b], [c]  ; ymm0 = (b * c) + ymm0

Summary

Key Concepts

SIMD: Process multiple data elements with one instruction
SSE: 128-bit (4x float, 2x double, 4x int32)
AVX: 256-bit (8x float, 4x double, 8x int32)
Alignment: 16-byte for SSE, 32-byte for AVX
Speedup: Typically 3-8x for vectorizable code

Common Instructions

Load/Store: movaps, movups, movdqa, movdqu
Arithmetic: addps, subps, mulps, divps
Comparison: cmpltps, cmpeqps, etc.
Logical: andps, orps, xorps
Shuffle: shufps, unpcklps, unpckhps

Best Practices

Align data to 16/32 bytes
Process 4/8 elements at a time
Handle remainder elements separately
Use non-destructive AVX when possible
Call vzeroupper after AVX code
Profile to verify actual speedup

Limitations

Requires contiguous memory
Benefits diminish with branching
Data dependencies reduce parallelism
Alignment requirements
CPU feature detection needed

Practice Exercises

Exercise 1: Max Element

Implement a function to find the maximum element in a float array using SSE.

Exercise 2: Vector Normalization

Normalize a 3D vector array (x, y, z) using SIMD.

Exercise 3: Matrix Multiplication

Implement 4x4 matrix multiplication using SSE.

Next Topic

In Topic 19: Performance & Optimization, we’ll explore advanced optimization techniques, profiling, and how to write the fastest possible assembly code.

Preview:

CPU pipeline and instruction-level parallelism
Cache optimization
Branch prediction
Loop unrolling
Register allocation strategies
Profiling and benchmarking