Topic 19: Performance and Optimization

Overview

Writing correct assembly is one thing; writing fast assembly is another. This topic covers techniques to maximize performance by understanding CPU architecture, instruction timing, memory hierarchy, and compiler-level optimizations.

// Correct but slow:
for (int i = 0; i < n; ++i) {
    result += array[i];  // Memory access every iteration
}

// Fast:
int temp = 0;
for (int i = 0; i < n; ++i) {
    temp += array[i];    // Register accumulation
}
result = temp;

CPU Architecture Fundamentals

Pipeline Stages

Modern CPUs use instruction pipelining to execute multiple instructions simultaneously:

Cycle: 1    2    3    4    5    6    7    8
Inst1: F    D    E    M    W
Inst2:      F    D    E    M    W
Inst3:           F    D    E    M    W
Inst4:                F    D    E    M    W

F = Fetch, D = Decode, E = Execute, M = Memory, W = Write-back

Throughput: Up to 1 instruction per cycle (ideal)
Latency: Individual instruction takes multiple cycles

Superscalar Execution

Modern CPUs can execute multiple instructions per cycle:

Port 0: ALU, FPU
Port 1: ALU, FPU
Port 2: Load (Memory Read)
Port 3: Load (Memory Read)
Port 4: Store (Memory Write)
Port 5: ALU, Branch

Example: These can execute in parallel:
mov rax, [data]     ; Port 2 (load)
add rbx, rcx        ; Port 0 or 1 (ALU)
imul rdx, rsi       ; Port 1 (multiply)

Instruction-Level Parallelism (ILP): CPU finds independent instructions and executes them simultaneously.

Instruction Timing

Latency vs Throughput

Latency: Cycles until result is available
Throughput: How often instruction can be issued

Example (typical Intel/AMD):

Instruction	Latency	Throughput	Ports
`mov r, r`	0-1	0.25/cycle	0,1,5,6
`add r, r`	1	0.25/cycle	0,1,5,6
`imul r, r`	3	1/cycle	1
`div r64`	36-95	21-74/cycle	0
`movaps xmm, m`	5-7	0.5/cycle	2,3
`addps xmm, xmm`	3-4	0.5/cycle	0,1

Key Insight: Focus on throughput for loops, latency for dependency chains.

Dependency Chains

; C equivalent:
; int a = 1;
; a = a + 1;  // Depends on previous
; a = a + 1;  // Depends on previous
; a = a + 1;  // Depends on previous

; BAD: Serial dependency chain (latency = 3 cycles)
mov eax, 1
add eax, 1          ; Waits for previous
add eax, 1          ; Waits for previous
add eax, 1          ; Waits for previous
; Total: ~3 cycles

; GOOD: Independent operations (throughput = 1 cycle)
mov eax, 1
mov ebx, 1
mov ecx, 1
add eax, 1          ; All execute in parallel
add ebx, 1
add ecx, 1
; Total: ~1 cycle

Memory Hierarchy

Cache Levels

CPU Registers:  < 1 cycle      ~1 KB
L1 Cache:       4-5 cycles     32-64 KB per core
L2 Cache:       12-15 cycles   256-512 KB per core
L3 Cache:       40-75 cycles   4-32 MB shared
RAM:            200+ cycles    Gigabytes

Performance Impact: Cache miss can be 50-200x slower than cache hit!

Cache Lines

Size: 64 bytes (typical)
Prefetching: CPU loads entire cache line, not just requested byte
Alignment: Misaligned access may span two cache lines

; C equivalent:
; struct Data { int value; char padding[60]; };  // 64 bytes
; Data array[100];  // Each element = one cache line

section .bss
    align 64
    data resb 6400      ; 100 x 64-byte structures

; Access aligned to cache line - FAST
mov eax, [data]
mov eax, [data + 64]
mov eax, [data + 128]

; Misaligned - may span cache lines - SLOWER
mov eax, [data + 1]

// BAD: Two threads writing to same cache line
struct {
    int counter1;  // Thread 1 writes here
    int counter2;  // Thread 2 writes here (same cache line!)
} shared;

// GOOD: Separate cache lines
struct {
    int counter1;
    char padding[60];
    int counter2;
} shared;

Optimization Techniques

1. Loop Unrolling

Process multiple iterations in one loop iteration to reduce overhead.

; C equivalent:
; int sum = 0;
; for (int i = 0; i < n; ++i) {
;     sum += array[i];
; }

; Original loop
xor eax, eax            ; sum = 0
xor ecx, ecx            ; i = 0
loop_original:
    add eax, [array + rcx*4]
    inc ecx
    cmp ecx, [n]
    jl loop_original
; Overhead: cmp, jl every iteration

; Unrolled by 4
xor eax, eax
xor ecx, ecx
mov edx, [n]
shr edx, 2              ; n / 4
loop_unrolled:
    add eax, [array + rcx*4]
    add eax, [array + rcx*4 + 4]
    add eax, [array + rcx*4 + 8]
    add eax, [array + rcx*4 + 12]
    add ecx, 4
    dec edx
    jnz loop_unrolled
; Overhead: dec, jnz every 4 iterations (4x less!)

Benefits:

Reduced branch overhead
More instruction-level parallelism
Better CPU utilization

Drawback: Increased code size

2. Multiple Accumulators

Break dependency chains by using multiple independent accumulators.

; C equivalent:
; int sum = 0;
; for (int i = 0; i < n; ++i) {
;     sum += array[i];
; }

; Single accumulator (dependency chain)
xor eax, eax
xor ecx, ecx
loop_single:
    add eax, [array + rcx*4]    ; Each add depends on previous
    inc ecx
    cmp ecx, [n]
    jl loop_single
; Latency-bound!

; Four accumulators (parallel)
xor eax, eax        ; sum0
xor ebx, ebx        ; sum1
xor edx, edx        ; sum2
xor esi, esi        ; sum3
xor ecx, ecx
loop_parallel:
    add eax, [array + rcx*4]
    add ebx, [array + rcx*4 + 4]
    add edx, [array + rcx*4 + 8]
    add esi, [array + rcx*4 + 12]
    add ecx, 4
    cmp ecx, [n]
    jl loop_parallel

; Combine accumulators
add eax, ebx
add edx, esi
add eax, edx
; Much faster! Throughput-bound instead of latency-bound

3. Software Pipelining

Overlap operations from different iterations.

; C equivalent:
; for (int i = 0; i < n; ++i) {
;     int temp = expensive_operation(array[i]);
;     result[i] = process(temp);
; }

; Original (serial)
loop_original:
    mov eax, [array + rcx*4]
    ; expensive_operation (10 cycles latency)
    imul eax, eax
    imul eax, eax
    ; process
    add eax, 1
    mov [result + rcx*4], eax
    inc rcx
    cmp ecx, [n]
    jl loop_original

; Pipelined (overlap operations)
mov eax, [array]            ; Preload first
imul eax, eax
loop_pipelined:
    imul eax, eax           ; Finish expensive op for i
    mov ebx, [array + rcx*4 + 4]  ; Start load for i+1
    add eax, 1              ; Process for i
    imul ebx, ebx           ; Start expensive op for i+1
    mov [result + rcx*4], eax
    mov eax, ebx            ; Move i+1 result to working register
    inc rcx
    cmp ecx, [n]
    jl loop_pipelined

4. Strength Reduction

Replace expensive operations with cheaper equivalents.

; C equivalent:
; for (int i = 0; i < n; ++i) {
;     result[i] = array[i] * 10;
; }

; SLOW: Multiplication every iteration
loop_mul:
    mov eax, [array + rcx*4]
    imul eax, 10                ; 3 cycles latency
    mov [result + rcx*4], eax
    inc rcx
    cmp ecx, [n]
    jl loop_mul

; FAST: LEA trick
loop_lea:
    mov eax, [array + rcx*4]
    lea eax, [rax + rax*4]      ; eax = rax + rax*4 = rax*5
    add eax, eax                ; eax = eax*2 = rax*10 (1 cycle!)
    mov [result + rcx*4], eax
    inc rcx
    cmp ecx, [n]
    jl loop_lea

Common Strength Reductions:

Original	Optimized	Savings
`x * 2`	`add x, x` or `shl x, 1`	2 cycles
`x / 2` (unsigned)	`shr x, 1`	20+ cycles
`x % powerof2`	`and x, (powerof2-1)`	20+ cycles
`x * 10`	`lea rax, [rax+rax*4]` + `add rax, rax`	2 cycles

5. Induction Variable Elimination

Compute addresses directly instead of using index.

; C equivalent:
; for (int i = 0; i < n; ++i) {
;     result[i] = array[i] + 1;
; }

; With index
xor ecx, ecx
loop_index:
    mov eax, [array + rcx*4]    ; Compute address every time
    add eax, 1
    mov [result + rcx*4], eax   ; Compute address every time
    inc ecx
    cmp ecx, [n]
    jl loop_index

; With pointer
lea rsi, [array]
lea rdi, [result]
mov ecx, [n]
loop_pointer:
    mov eax, [rsi]              ; Direct dereference
    add eax, 1
    mov [rdi], eax              ; Direct dereference
    add rsi, 4                  ; Increment pointer
    add rdi, 4
    dec ecx
    jnz loop_pointer
; Fewer address calculations!

Branch Optimization

Branch Prediction

Modern CPUs predict branch outcomes to avoid pipeline stalls.

Branch Misprediction Penalty: 15-20 cycles!

; C equivalent:
; int count = 0;
; for (int i = 0; i < n; ++i) {
;     if (array[i] > threshold) {  // Hard to predict!
;         count++;
;     }
; }

; BAD: Unpredictable branch
xor eax, eax            ; count
xor ecx, ecx
loop_branch:
    mov edx, [array + rcx*4]
    cmp edx, [threshold]
    jle skip                ; 50% misprediction if data is random
    inc eax
skip:
    inc ecx
    cmp ecx, [n]
    jl loop_branch

Branch-Free Code (Predication)

; C equivalent (branchless):
; count += (array[i] > threshold);

; GOOD: Branchless with conditional move
xor eax, eax
xor ecx, ecx
loop_branchless:
    mov edx, [array + rcx*4]
    xor ebx, ebx            ; temp = 0
    cmp edx, [threshold]
    setg bl                 ; bl = 1 if edx > threshold, else 0
    add eax, ebx            ; No branch!
    inc ecx
    cmp ecx, [n]
    jl loop_branchless

; Alternative with CMOV
loop_cmov:
    mov edx, [array + rcx*4]
    mov ebx, eax
    inc ebx                 ; ebx = count + 1
    cmp edx, [threshold]
    cmovg eax, ebx          ; count = (edx > threshold) ? ebx : count
    inc ecx
    cmp ecx, [n]
    jl loop_cmov

Loop-Invariant Code Motion

Move calculations that don’t change out of the loop.

; C equivalent:
; int factor = x * 2 + 5;
; for (int i = 0; i < n; ++i) {
;     result[i] = array[i] * factor;
; }

; BAD: Recomputing factor every iteration
loop_bad:
    mov eax, [x]
    shl eax, 1
    add eax, 5              ; Computed every iteration!
    imul eax, [array + rcx*4]
    mov [result + rcx*4], eax
    inc rcx
    cmp ecx, [n]
    jl loop_bad

; GOOD: Compute factor once
mov eax, [x]
shl eax, 1
add eax, 5
mov ebx, eax            ; Save factor

loop_good:
    mov eax, [array + rcx*4]
    imul eax, ebx       ; Use saved factor
    mov [result + rcx*4], eax
    inc rcx
    cmp ecx, [n]
    jl loop_good

Data Alignment

Alignment Requirements

; SLOW: Misaligned data
section .data
    x db 0
    data dq 0x123456789ABCDEF0  ; Misaligned!

; FAST: Aligned data
section .data
    align 8
    data dq 0x123456789ABCDEF0  ; 8-byte aligned

; CRITICAL for SIMD
section .data
    align 16
    floats dd 1.0, 2.0, 3.0, 4.0  ; Must be 16-byte aligned for movaps

Performance Impact:

Misaligned scalar access: 0-10% slower
Misaligned SIMD access: Can be 2-3x slower or crash!

Structure Padding

// BAD: Lots of padding, poor cache utilization
struct Data {
    char a;       // 1 byte
    // 7 bytes padding
    double b;     // 8 bytes (needs 8-byte alignment)
    char c;       // 1 byte
    // 7 bytes padding
};  // Total: 24 bytes (13 bytes wasted!)

// GOOD: Reordered to minimize padding
struct Data {
    double b;     // 8 bytes
    char a;       // 1 byte
    char c;       // 1 byte
    // 6 bytes padding (end)
};  // Total: 16 bytes (6 bytes wasted)

Complete Optimization Example

Problem: Compute average of large array

; Original - unoptimized
; C equivalent:
; double average = 0;
; for (int i = 0; i < n; ++i) {
;     average += array[i];
; }
; average /= n;

average_slow:
    xorpd xmm0, xmm0        ; sum = 0.0
    xor ecx, ecx            ; i = 0
.loop:
    addsd xmm0, [rdi + rcx*8]   ; Scalar add (2 doubles per cycle max)
    inc ecx
    cmp ecx, esi
    jl .loop
    
    cvtsi2sd xmm1, esi      ; Convert n to double
    divsd xmm0, xmm1        ; Divide (very slow!)
    ret

; Optimized version
; Uses: SIMD, unrolling, multiple accumulators

average_fast:
    ; Check if we have enough elements for SIMD
    cmp esi, 8
    jl .scalar_fallback
    
    ; Initialize 4 accumulators for parallel accumulation
    xorpd xmm0, xmm0        ; sum0
    xorpd xmm1, xmm1        ; sum1
    xorpd xmm2, xmm2        ; sum2
    xorpd xmm3, xmm3        ; sum3
    
    xor ecx, ecx
    mov edx, esi
    shr edx, 3              ; Process 8 doubles per iteration
    jz .remainder
    
.loop_simd:
    ; Process 8 doubles = 4 * 2 doubles
    addpd xmm0, [rdi + rcx*8]       ; Add 2 doubles
    addpd xmm1, [rdi + rcx*8 + 16]  ; Add 2 doubles
    addpd xmm2, [rdi + rcx*8 + 32]  ; Add 2 doubles
    addpd xmm3, [rdi + rcx*8 + 48]  ; Add 2 doubles
    add ecx, 8
    dec edx
    jnz .loop_simd
    
.remainder:
    ; Handle remaining elements (n % 8)
    mov edx, esi
    and edx, 7
    jz .horizontal_sum
    
.remainder_loop:
    addsd xmm0, [rdi + rcx*8]
    inc ecx
    dec edx
    jnz .remainder_loop
    
.horizontal_sum:
    ; Combine the 4 accumulators
    addpd xmm0, xmm1        ; xmm0 = sum0+sum1
    addpd xmm2, xmm3        ; xmm2 = sum2+sum3
    addpd xmm0, xmm2        ; xmm0 = all sums
    
    ; Horizontal add within xmm0
    movapd xmm1, xmm0
    unpckhpd xmm1, xmm0     ; Extract high double
    addsd xmm0, xmm1        ; Final sum
    
    ; Divide by n (use multiplication by reciprocal if n is constant!)
    cvtsi2sd xmm1, esi
    divsd xmm0, xmm1
    ret
    
.scalar_fallback:
    ; Handle small arrays
    xorpd xmm0, xmm0
    xor ecx, ecx
.scalar_loop:
    addsd xmm0, [rdi + rcx*8]
    inc ecx
    cmp ecx, esi
    jl .scalar_loop
    cvtsi2sd xmm1, esi
    divsd xmm0, xmm1
    ret

Optimizations Applied:

✓ SIMD (4x parallelism with packed doubles)
✓ Loop unrolling (8 elements per iteration)
✓ Multiple accumulators (break dependency chains)
✓ Separate remainder handling
✓ Scalar fallback for small arrays

Expected Speedup: 10-15x over naive version!

Profiling and Measurement

CPU Cycle Counting

section .text
global benchmark_function

benchmark_function:
    ; Save registers
    push rbx
    
    ; Read timestamp counter (before)
    rdtsc                   ; EDX:EAX = cycle count
    shl rdx, 32
    or rax, rdx             ; RAX = 64-bit cycle count
    mov rbx, rax            ; Save start time
    
    ; *** Code to benchmark ***
    ; ...
    
    ; Read timestamp counter (after)
    rdtsc
    shl rdx, 32
    or rax, rdx
    
    ; Calculate elapsed cycles
    sub rax, rbx            ; RAX = elapsed cycles
    
    pop rbx
    ret

Using `perf` (Linux)

# Count cycles, instructions, cache misses
perf stat -e cycles,instructions,cache-misses ./program

# Profile with sampling
perf record -g ./program
perf report

# Analyze specific events
perf stat -e L1-dcache-loads,L1-dcache-load-misses ./program

Optimization Checklist

Algorithm Level

Use optimal algorithm (O(n log n) vs O(n²))
Minimize memory allocations
Cache reusable computations

Data Structure Level

Align data structures
Use cache-friendly layouts (SoA vs AoS)
Pack structures to minimize padding
Consider false sharing in multithreaded code

Loop Level

Instruction Level

Use faster instructions (LEA, shifts vs multiply/divide)
Minimize dependency chains
Use conditional moves instead of branches
Align loop targets
Avoid partial register stalls

Memory Level

Prefetch data before use
Access memory sequentially
Minimize cache misses
Use appropriate data alignment

Common Pitfalls

1. Premature Optimization

// DON'T optimize before profiling!
// 90% of time is often in 10% of code

// Profile first:
// - Find the hot spots
// - Optimize those
// - Measure improvement

2. Over-Optimization

; Sometimes "optimized" code is slower!

; "Optimized" but slower (more instructions, worse for branch prediction)
xor eax, eax
cmp [value], 0
setg al
add [result], eax

; Simpler and faster (well-predicted branch)
cmp [value], 0
jle .skip
inc dword [result]
.skip:

3. Ignoring the Compiler

// Modern compilers are VERY good
// Sometimes C is faster than hand-written assembly!

// Let the compiler optimize:
int sum = 0;
for (int i = 0; i < n; ++i) {
    sum += array[i];  // Compiler may auto-vectorize this!
}

Summary

Key Principles

Profile first - Optimize hot paths only
Algorithm beats microoptimization - O(n) vs O(n²) matters more
Understand hardware - Cache, pipeline, branch prediction
Break dependencies - Use multiple accumulators
Use SIMD - 4-16x speedup for parallel data

Optimization Priorities

Algorithm (100x+ gains possible)
Data structures (10x+ gains)
Loop optimizations (2-5x gains)
SIMD (4-16x gains)
Instruction selection (1.5-2x gains)
Micro-optimizations (5-20% gains)

Tools

perf - CPU performance counters
valgrind/cachegrind - Cache profiling
gprof - Function-level profiling
RDTSC - Cycle-accurate timing
VTune - Intel’s profiler (commercial)

Practice Exercises

Exercise 1: Matrix Multiplication

Optimize 4x4 matrix multiplication using SIMD and unrolling.

Exercise 2: String Length

Write the fastest possible strlen implementation.

Exercise 3: Sorting

Implement optimized quicksort with:

Insertion sort for small arrays
Three-way partitioning
Tail recursion elimination

Next Topic

In Topic 20: Debugging & Tools, we’ll explore essential debugging techniques, disassembly analysis, and tools for working with assembly code.

Preview:

GDB debugging
Disassembly with objdump
strace for syscall tracing
Memory debugging with valgrind
Reading crash dumps
Reverse engineering basics