Game Developers Conference™ Europe 2011 August 15-17, 2011 | Cologne, Germany www.GDCEurope.com

**GOC** Europe

# Hotspots, FLOPS, and uOps: To-The-Metal CPU Optimization

Levent Akyil Intel Corp.



oft(dense



### Fast Code == More Stuff == More Fun



Before

After

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 



Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# Source Code Does Not Provide The Complete Story

### Example Optimization Tips:

"Use the return value optimization"

- "Pass by reference instead of value"
- "Use ++i instead of i++"
- "cache intermediate computations"
- "Unroll loops"



Has this guy studied <u>your</u> code?

Without context, Ad hoc source code tips may result in random trial and error.

DeveloperS Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





Intel<sup>®</sup> Microarchitecture Codename SandyBridge

### Take the Guesswork out of Optimization!

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# Key x86 Architecture Features for Developers To Know



32K BPU Instruction Cache Cache (data and instr) uncore Legacy Decode Decoded MSROM Pipeline **ICache** micro-op queue 256K L2 **Branch Prediction** Cache Rename/retirement 2 Load 32K 3 Store (address) Data Cache 4 Store data 4 Out-of-order uOp Scheduling Integer scheduler MMX/SSE AVX Low Wide Registers up to 256-bit 0 X87 AVX High Intel<sup>®</sup> Microarchitecture

Codename SandyBridge

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

### **Cache Behavior Affects Performance**



### •Cost of Data Access increases with Distance from CPU

- •Programming Tips:
- Maximize work done on cached data
- Work with Hardware Prefetch (arrays vs linked lists)

### Cost of accessing data

| Where Data Is<br>Resident | Time to fetch data |
|---------------------------|--------------------|
| Register                  | 1 cycle            |
| L1 Cache                  | 4 cycles           |
| L2 Cache                  | 10 cycles          |
| L3 Cache                  | 40-75 cycles       |
| Метогу                    | 60-100 ns          |

http://software.intel.com/sites/products/collateral/hpc/vtune/per formance\_analysis\_guide.pdf

**Performation** Products Division

Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners



# Key x86 Architecture Features for Developers To Know



- Cache (data and instr)
  - Branch Prediction
  - Out-of-order uOp Scheduling
  - Wide Registers up to 256-bit



Intel<sup>®</sup> Microarchitecture Codename SandyBridge

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Rock your code.

8

# **BPU Helps (when predictable)**



Rock your code.

9

for(int i=0; i<4096; i++)</pre> m = max(m, x[i]);

#### Compiled using branch code

pit(de

#### Compiled using max instruction

| vmovss<br>vcomiss<br>jbe<br>vmovss<br>add<br>cmp<br>jl | <pre>xmm0,<br/>findm<br/>xmm0,<br/>eax,4<br/>eax,</pre> | <pre>in+20h xmm0,xmm1 (0BFF3C0h) ax+12h</pre> |       |             | vmovs<br>add<br>vshu:<br><b>vmax</b> s<br>cmp<br>jl | fps   | eax,4<br>xmm0,xmm<br><b>xmm1,xmm</b><br>eax, (0E | •                                      | ]              |
|--------------------------------------------------------|---------------------------------------------------------|-----------------------------------------------|-------|-------------|-----------------------------------------------------|-------|--------------------------------------------------|----------------------------------------|----------------|
|                                                        |                                                         | *********                                     | ***** | *********   |                                                     |       | eren eren eren                                   |                                        |                |
|                                                        |                                                         | Array Ordering                                |       | bran        | ch                                                  | ma    | ix uOp                                           |                                        |                |
|                                                        |                                                         | Monotonic                                     |       | 2.1         |                                                     | 3.0   | )                                                | cycles per iterat<br>(lower is better) |                |
|                                                        |                                                         | Pathological                                  |       | 9.8         |                                                     | 3.0   | )                                                |                                        |                |
| lancosta                                               | redata                                                  | Random                                        |       | 2.2         |                                                     | 3.0   | )                                                |                                        | yese Ordinessi |
| Priverise vide                                         |                                                         | and PDO Il creat                              | arth  | (pp) or its | I DUB                                               | IE sa | ant proteint sa                                  | Dealeman                               | -              |

**Developers** Software & Services Group, Developer Products Division

**Optimization Notice** 

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners. APRI Innt

# Key x86 Architecture Features for Developers To Know



- Cache (data and instr)
- Branch Prediction
- Out-of-order uOp Scheduling
- Wide Registers up to 256-bit



Intel<sup>®</sup> Microarchitecture Codename SandyBridge

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

ation Notice



### Understanding Out-Of-Order Execution



### How would you expect these 3 loops to perform?

| Test  | Code                                                         | Measured* CPU<br>Cycles |
|-------|--------------------------------------------------------------|-------------------------|
| SASPY | for(int i=1; i <n; i++)<br="">s[i] = a * s[i-1] + y[i];</n;> |                         |
| SAXPS | for(int i=1; i <n; i++)<br="">s[i] = a * x[i] + s[i-1];</n;> |                         |
| SAXPY | for(int i=1; i <n; i++)<br="">s[i] = a * x[i] + y[i];</n;>   |                         |

Comparison of 3 near-identical loops with different data access patterns

Developers Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

**Optimization Notice** 

# Similar code, yet **significant** differences in Performance



| Test  | Code                                                         | Measured* CPU<br>Cycles |
|-------|--------------------------------------------------------------|-------------------------|
| SASPY | for(int i=1; i <n; i++)<br="">s[i] = a * s[i-1] + y[i];</n;> | 14.0                    |
| SAXPS | for(int i=1; i <n; i++)<br="">s[i] = a * x[i] + s[i-1];</n;> | 9.0                     |
| SAXPY | for(int i=1; i <n; i++)<br="">s[i] = a * x[i] + y[i];</n;>   | 2.2                     |

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

### Rock your code.

Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

### Assembly Code Comparison



```
for(int i=0;i<N;i++) // saxpy
s[i] = a * x[i] + y[i];</pre>
```

```
for(int i=1;i<N;i++) // saspy
s[i] = a * s[i-1] + y[i];</pre>
```

### Assembly code:

```
vmovss xmm0,dword ptr X[eax]
vmulss xmm0,xmm0,dword ptr [A]
vaddss xmm0,xmm0,dword ptr Y[eax]
vmovss dword ptr S[eax],xmm0
add eax,4
cmp eax,1000h
jl saxpy+7 (13A1041h)
```

| vmovss | <pre>xmm0,dword ptr S[eax]</pre>       |
|--------|----------------------------------------|
| vmulss | xmm0,xmm0,dword ptr [A]                |
| vaddss | <pre>xmm0,xmm0,dword ptrY+4[eax]</pre> |
| vmovss | dword ptr S+4 [eax],xmm0               |
| add    | eax,4                                  |
| cmp    | eax,0FFCh                              |
| jl     | saspy+7 (13A10B5h)                     |

### Same instruction sequence!

Developers Software & Services Group, Developer Products Division

Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# Throughput != Latency



Throughput

14

130

.5

2

Instruction "cost" consists of two important timings:

- Instruction Latency time to complete the operation and return result
- Instruction Throughput frequency at which a new operation can be issued

| eration can be | 0,000       |     |
|----------------|-------------|-----|
|                | move        | >=1 |
|                | load/store  |     |
| 1              | dot product | 12  |
| les for a      |             |     |

Operation

+ - \* rsqrt, rcp,

hadd, min.max

div, sqrt

sin.cos

Eg: while waiting 5 cycles for a multiplication to complete we can begin 5 other multiplication operations.

Note: This table is an extremely condensed version of the Intel<sup>®</sup> Architecture Manual

Latency

3-5

14

160-200

**EIOPERS** Software & Services Group, Developer Products Division

Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# How x86 Instructions Get Processed

ASM instructions => Uops asm registers => physical registers

- uOps execute when ready
- Up to 1/port/cycle
- Data order dependent. i.e. Not instruction order dependent



| Port #               | 0  | 1   | 2    | 3    | 4     | 5       |
|----------------------|----|-----|------|------|-------|---------|
| Operations<br>(uOps) | */ | + - | Load | Load | Store | Shuffle |

Developers Software & Services Group, Developer Products Division

Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Optimization Notice

# The Pipeline Slot Methodology,



Case 1: Front-End does not provide micro-operations for all 4 pipeline slots

### Front-End Bound

Rock your code.

**Optimization Notice** 

Developers Software & Services Group, Developer Products Division

Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

### The Pipeline Slot Methodology, Illustrated



Rock your code.

**Optimization Notice** 



Case 2: Back-End cannot accept micro-operations for all 4 pipeline slots

### Back-End Bound

Developers Software & Services Group, Developer Products Division

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

### The Pipeline Slot Methodology, Illustrated





Case 3: Micro-operations make it to the Back-End, but then get removed from the pipeline

### Cancelled

Developers Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

**Optimization Notice** 

### The Pipeline Slot Methodology, Illustrated





Case 4: Micro-operations make it to the Back-End, Execute, and then Retire

# Retired Developers software & Services Group, Developer Products Division Optimization Notice

Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# s=ax+y decomposed into uOps



Rock your code.

20

### Output from Intel Architecture Code Analyzer (IACA)

Intel(R) Architecture Code Analyzer Loop Throughput: <u>2 Cycles;</u>Loop Latency: <u>14 Cycles;</u> Ports pressure in cycles Num 1 | 2 | 3 | 4 | 5 | Assembly Code Uops | 0 1 

 1
 |
 |
 1
 |
 |
 |
 wmovss xmm0, ptr a

 2^
 |
 1
 |
 |
 1
 |
 |
 wmovss xmm0, ptr a

 2^
 |
 1
 |
 1
 |
 |
 wmovss xmm0, ptr x

 2^
 |
 1
 1
 |
 |
 wmovss xmm0, xmm0, ptr y

 | | 1 | 1 | 2^ | | vmovss ptr s, xmm0 | | | 1 | add eax, 0x4 | | | 1 | cmp eax, 0x8000 0F | il 0xffffffcc 1 | 2 | 2 | 1 | 2 | |Cvcles| 1 |

### Note the range in throughput and latency

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# **Data Dependencies** Explain the Disparity in Performance



| Test  | Code                                                         | Measured* CPU Cycles  |
|-------|--------------------------------------------------------------|-----------------------|
| SASPY | for(int i=1; i <n; i++)<br="">s[i] = a * s[i-1] + y[i];</n;> | 14.0                  |
| SAXPS | for(int i=1; i <n; i++)<br="">s[i] = a * x[i] + s[i-1];</n;> | 9.0                   |
| SAXPY | for(int i=1; i <n; i++)<br="">s[i] = a * x[i] + y[i];</n;>   | 2.2<br>(1.6 unrolled) |

Notes:

- Range matches predicted latency and throughput times.
- 2<sup>nd</sup> loop can begin multiplication early
- 3<sup>rd</sup> loop (no dependencies) benefits further with compiler-generated loop unroll

Developers Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# Reduce Dependencies to Maximize Throughput



| Find Array<br>Maximum       | Code                                                                                       | Cycles |
|-----------------------------|--------------------------------------------------------------------------------------------|--------|
| Standard<br>Solution        | <pre>for(int i=0; i<n; i++)="" m="max(m,x[i]);&lt;/pre"></n;></pre>                        | 3.0    |
| Loop<br>Unrolled            | <pre>for(int i=0; i<n; i+="2)" m="max(m,x[i+1]);&lt;/pre"></n;></pre>                      | 3.0    |
| Dependence<br>Reduced       | <pre>for(int i=0; i<n; i+="2)" m0="max(m0,x[i]);" m1="max(m1,x[i+1]);&lt;/pre"></n;></pre> | 1.6    |
| Dependence<br>Reduced Twice | <pre>for(int i=0; i<n; i+="4)" m0="max(m0,x[i]);" m3="max(m3,x[i+3]);&lt;/pre"></n;></pre> | 1.0    |

Note: x86 instruction VMAXSS has a latency of 3

Developers Software & Services Group, Developer Products Division

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

**Optimization Notice** 

# Key x86 Architecture Features for Developers To Know



- Cache (data and instr)
- Branch Prediction
- Out-of-order uOp Scheduling
- Wide Registers up to 256-bit



Intel<sup>®</sup> Microarchitecture Codename SandyBridge

Developers Software & Services Group, Developer Products Division

Optimization Notice

Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# SIMD with AVX - up to 256 bit (8 floats)



- New instructions with 2<sup>nd</sup> generation Intel Core CPUs
- Supports 128 and 256-bit SIMD
- Non-destructive instructions



### C/C++ Intrinsics

\_\_m256 a,b,c; ... c = \_mm256\_add\_ps(a,b);

Developers Software & Services Group, Developer Products Division

Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Optimization Notice

### AVX SIMD applied to SAXPY



| Test                    | Code                                                                                                              | Measured*<br>CPU Cycles / N |
|-------------------------|-------------------------------------------------------------------------------------------------------------------|-----------------------------|
| SAXPY<br>1 at a time    | <pre>// float *s,*x,*y,a; for(int i=0; i<n; *="" +="" i++)="" pre="" s[i]="a" x[i]="" y[i];<=""></n;></pre>       | 2.2                         |
| SAXPY128<br>4 at a time | <pre>// m128 *s, *x, *y, a; for(int i=0; i<n *="" +="" 4;="" i++)="" pre="" s[i]="a" x[i]="" y[i];<=""></n></pre> | 0.6                         |
| SAXPY256<br>8 at a time | <pre>// m256 *s, *x, *y, a; for(int i=0; i<n *="" +="" 8;="" i++)="" pre="" s[i]="a" x[i]="" y[i];<=""></n></pre> | 0.3                         |

\*Measured time is total time divided by N (N==2048)

Developers Software & Services Group, Developer Products Division

it/dense

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# Array Maximum Example with AVX SIMD



| Array<br>Maximum              | Code                                                                                        | Cycles<br>Serial | SIMD4<br>version | SIMD8<br>version |
|-------------------------------|---------------------------------------------------------------------------------------------|------------------|------------------|------------------|
| Standard<br>Solution          | for(int i=0; i <n; i+="8)&lt;br">m=_mm256_max_ps((m256*)(x+i));<br/>//m = max(m,x[i]);</n;> | 3.0              | 0.73             | 0.36             |
| Dependence<br>Reduced         | for(int i=0; i <n; i+="2)&lt;br">m0 = max(m0,x[i]);<br/>m1 = max(m1,x[i+1]);</n;>           | 1.6              | 0.38             | 0.18             |
| Dependence<br>Reduced<br>More | for(int i=0; i <n; i+="4)&lt;br">m0 = max(m0,x[i]);<br/><br/>m3 = max(m3,x[i+3]);</n;>      | 1.0              | 0.26             | 0.13             |

### Exploiting both SIMD and Instruction parallelism

Developers Software & Services Group, Developer Products Division

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

**Optimization Notice** 

# SIMD Programming Patterns in Games



### Typical "SSE" 4D Vector Class:

```
class Vec4
{
  public:
    union {
      struct {float x,y,z,w;}
      _m128 v;
    }
};
inline Vec4 operator+(const Vec4 &a, const Vec4 &b)
{
  return Vec4(_mm_add_ps(a.v,b.v));
}
```

Vec4 v = u + w; // add two 4D vectors ...

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 



Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# Scaling with 4D xyzw SIMD pattern



Real Results typically ~2X

- Not using all 4 flops at each operation.
- Often used for 3D data
- Shuffle overheads
- Instruction parallelism sometimes lost
- Pattern doesn't Scale to 256-bit (8float) SIMD

x y z w x y z w x yy z w +



3D/4D dot product 128-bit SIMD 3D dot product serial

Developers Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

**Optimization Notice** 

# Another Way To Use SIMD is SOA



<u>Array of Structures in Memory (AOS)</u>

| x0 y0 z0 w0 x1 | yl zl | w1 x2 |  |
|----------------|-------|-------|--|
|----------------|-------|-------|--|

### Structure of Arrays in Memory (SOA)

| <b>x</b> 0 | <b>x</b> 1 | <b>x</b> 2 | <b>x</b> 3 | x4 |     |
|------------|------------|------------|------------|----|-----|
| <u>у</u> 0 | yı         | y2         | у3         | ¥4 |     |
| zO         | <b>z</b> 1 | z2         | z3         | z4 |     |
| <b>w</b> 0 | w1         | w2         | w3         | w4 | ••• |

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 



Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# C++ programming patterns for SOA

```
class Vec3<T>
{
public:
   T x;
   Ty;
   T z;
};
Vec3<T> operator + (const Vec3<T> &a, const Vec3<T> &b)
{
          return Vec3<T>(a.x+b.x, a.y+b.y, a.z+b.z);
}
  m256 operator + (const m256 &a, const m256 &b)
{
          return mm256 add ps(a,b);
```

}

alt(dense

. . .

Vec3<\_m256> v = u + w; // 8 vector additions at a time

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Developers Software & Services Group, Developer Products Division



# Gather/Scatter - To SOA and Back! Dealing With 'Real' Application Data





### AVX 256-bit Programming Patterns - Gather/Scatter

### Technique

- Linear Traversal of an Array
  - Exploit regular access patterns
  - Use x86 shuffle for Transpose

#### Example

```
Vec3<float> v[];
for(int i=0; i<N; i+=8) {
    Vec3<__m256> u = trans8x3(v+i);
    u = Normalize(u); // 8 at a time
    trans3x8(v+i,u);
}
```

- Indexing/Indirection (Gather 8)
  - Use 4 float xyzw data pattern
  - Align Data (pad if necessary)
  - 4x8 transpose
  - SOA code patterns

```
Vec4<float> v[];
for(int i=0; i<N; i+=8) {
    Vec8<__m128> g(v[k[i]],...,v[k[i+7]]);
    Vec4<__m256> u = trans8x4(g);
    r = MyCompute<__m256>(u); // do 8
    ...
```

- Indexing/Indirection (Gather 2)
  - Use 4 float xyzw data pattern
  - Use 256-bit as a way to Pair two 128-bit computations
  - AOS xyzw code patterns

```
Vec4<float> v[];
for(int i=0; i<N; i+=2) {
    __m256 u(v[k[i]],v[k[i+1]]);
    r = MyComputePair(u); // 2 at a time
    ...;
}
```

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 



Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

### Pairing Two 128-bit Computations –





### Pairing Two 128-bit Computations - Skinning Example



Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

anni literate

# **Performance tuning**

Agenda:

- x86 Architecture
  - Program Execution Flow
  - AVX<sup>®</sup> SIMD
  - Easy Effective code patterns
- Performance Tuning Workflow
  - Hotspot profiling
  - Events and vTune<sup>®</sup> performance guided analysis
- Walkthrough/Examples



Intel<sup>®</sup> Microarchitecture Codename SandyBridge

Rock your code.

### Take the Guesswork out of Optimization!

Developers Software & Services Group, Developer Products Division

Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# Code profiling and performance tuning



- Goal: Make programs run *faster*
- For video games, it's low latency
  - Finish drawing in a bounded amount of time
- Where do I start optimizing?
  - Limited time, maximize effort
- Solution: Use code profiling tools for performance tuning

/PIOPERS Software & Services Group, Developer Products Division

Optimization Notice

#### Rock your code.

# 2 kinds of performance tuning



#### Algorithmic

- Applies to all architectures
- Generally improves code elegance and conciseness
- Also includes the quality of parallel decompositions, CPU usage, and other multithreading issues
- Hardware
  - Architecture-specific (though commonalities exist)
  - Tends to obfuscate code (e.g. matrix blocking)
  - Requires architectural understanding

Both are essential!

210per 5 Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation, All rights reserved. \*Other brands and names are the property of their respective owners

# Analysis and tuning workflow





#### Performance tuning is an iterative process

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

anni linna

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



## Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE



- Helps analyze code performance
  - Multi-threaded and hardware bottlenecks
  - Find hotspots, analyze thread performance
  - Compare before and after performance
- Available for Windows and Linux
  - Integrates with Microsoft Visual Studio
  - Also standalone GUI for both Windows & Linux

#### http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/

Velopers Software & Services Group, Developer Products Division Copyright<sup>\*</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners Optimization Notice



Rock your code.

40

### Types of analysis

dense





Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Carri Inny

#### Rock your code.

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## Intel VTune Amplifier XE Algorithmic Analysis: Hotspots



Intel VTune Amplifier XE 2011 Concurrency - Hotspots by Thread Concurrency 🖊 🧿 Analysis Type 🛛 🖞 Summary 🖶 Analysis Target 🚯 Bottom-up 🚺 Top-down Tree CPU time CPU Time by Utilization Hottest Call Stack /Function /Call Stack 🛛 Idle 🛢 Poor 📋 Ok 🛢 Ideal 📒 Over 9 18 star WARSCAPE::WS\_RENDER\_DEVICE::present 31.445s Current stack is 90.2% of selection 29.1685 WaitForSingleObjectEx 98.2% (30.876s of 31.445s) WARSCAPE::FX\_MANAGER\_ENTRY::d3d\_create\_fx\_inner 18.868s Empire.zintelUnityRelease.exelWARSCAPE::WS\_RENDER\_DEVICE::present(struct tagRECT \*,stru., ■WARSCAPE::TEXTURE\_MANAGER\_ENTRY::update\_compressed<struct CA::Pixel8888> 1.965s Empire.zIntelUnityRelease.exelWARSCAPE::WS\_ENGINE\_IMP::pr\_present(void) - engine.cpp:219 ■WARSCAPE: TEXTURE\_MANAGER\_ENTRY render target to texture 1 056-Empire.zIntelUnityRelease.exelWARSCAPE::WS\_ENGINE\_IMP::present(bool) - engine.cpp:2216 anonymous namespace'::VFS\_IN **Hottest Functions** Empire.zIntelUnityRelease.exelEMPIRE::EMPIRE\_APP\_MODULE::run\_loop(void) - empire.cpp:38. anonymous namespace'::IMAGE Empire.zIntelUnityRelease.exel'anonymous namespace'::winmain\_inner(struct HINSTANCE\_\_\_\_ CA::UniStringHdr::allocate\_thread Empire.zIntelUnityRelease.exelwWinMain - empire.cpp:4589 1AZ/S

#### Quickly identify what is important

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 



Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

#### Intel VTune Amplifier XE **Algorithmic Analysis Concurrency and Frame Analysis**







Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

43

#### Intel VTune Amplifier XE Algorithmic Analysis Concurrency and Frame Analysis







#### Intel VTune Amplifier XE **Algorithmic Analysis Concurrency and Frame Analysis**









Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

46

#### Hardware event-based sampling



- Performance Monitoring Unit (PMU) counters
  - + Offer a unique and powerful view into the CPU
    - reveal architectural bottlenecks in uninstrumented code running at full speed
    - hundreds of events offer insights into every part of the microarchitecture
  - Methodology to use event counters in a top-down optimization methodology, but beyond the scope of this class
- At the level of functions and higher, raw events aren't that useful
  - Who cares if I experienced 2,166,000,000 DTLB misses? What matters is how much it cost me!

**EVEIOPETS** Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective own **Optimization Notice** 

Rock your code.

#### Hardware event-based sampling



Rock your code.

48

- VTune<sup>™</sup> Amplifier XE helps make PMU-based performance tuning easier
  - Several predefined analysis types help you focus on specific problems
  - Even better, VTune<sup>™</sup> Amplifier XE shows *metrics* over PMU event counts. Instead of 2.17B DTLB misses, we can see what proportion of the time the app was dealing with DTLB overhead...
- What does that look like?

IOPELS Software & Services Group, Developer Products Division Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective ov

## Hardware event-based sampling -- Example

|                          |                    | /nfs/fx/ho                        | me/lwbaugh/int                     | el/ampixe/Proje                                 | ts/Yea! - Intel \                  | /Tune Amplifier                                 | XE 2011                               |                                      |                                     |                    |
|--------------------------|--------------------|-----------------------------------|------------------------------------|-------------------------------------------------|------------------------------------|-------------------------------------------------|---------------------------------------|--------------------------------------|-------------------------------------|--------------------|
| le Help                  |                    |                                   |                                    |                                                 |                                    |                                                 |                                       |                                      |                                     |                    |
| 🧾 🖆 🖬                    | 🔯 🕨 🛋 🔹            |                                   |                                    |                                                 |                                    |                                                 |                                       |                                      |                                     |                    |
| r003ge 🙁                 |                    |                                   |                                    |                                                 |                                    |                                                 |                                       |                                      |                                     |                    |
| General                  | Exploration - H    | Hardware Eve                      | ent Counts 🦊                       | ?                                               |                                    |                                                 |                                       | Intel VT                             | une Amplifier )                     | KE 201             |
| \varTheta Analysis       |                    | Type 👖 Summa                      |                                    |                                                 |                                    |                                                 |                                       |                                      |                                     |                    |
|                          |                    |                                   |                                    | - P                                             |                                    |                                                 |                                       | ///////                              |                                     |                    |
|                          | ž                  | CPU_CLK_U<br>THREAD by<br>Package | CPU_CLK_U<br>REF_TSC by<br>Package | OFFCORE_R DATA_IN_SO<br>LLC_MISS_L<br>DRAM_0 by | INST_RETIRED.<br>ANY by<br>Package | OFFCORE_R DANY_REQUE<br>LLC_MISS_L<br>DRAM_0 by | MEM_LOAD<br>XSNP_HIT_PS<br>by Package | MEM_LOAD<br>LLC_HIT_PS<br>by Package | MEM_LOAD<br>XSNP_HITM<br>by Package | ]<br>REPLA<br>by P |
| [nfs]                    | 0                  | 0                                 | 2,000,000                          | 0                                               | 0                                  | 0                                               | 0                                     | 0                                    | C                                   | )                  |
| [sep3_4]                 | 0                  | 0                                 | 0                                  | 0                                               | 0                                  | 0                                               | 0                                     | 0                                    | C                                   | )                  |
| [vmlinux]                | 0                  | 1,430,000,000                     | 1,368,000,000                      | 37,600,000                                      | 1,100,000,000                      | 32,800,000                                      | 0                                     | 0                                    | C                                   |                    |
| checkSTREAM              | 1results 0         | 798,000,000                       | 740,000,000                        | 30,000,000                                      | 802,000,000                        | 29,600,000<br>979,200,000                       | 0                                     | 200,000                              | C                                   | )<br>) 1,0         |
|                          |                    |                                   |                                    |                                                 |                                    | R.                                              |                                       |                                      |                                     |                    |
|                          | Selected 1 row(s): | 18,722,000,000                    | 17,300,000,000                     | 966,800,000                                     | 20,326,000,000                     | 979,200,000                                     | 0                                     | 186,400,000                          | C                                   | ) 1,               |
|                          |                    |                                   |                                    |                                                 |                                    |                                                 |                                       |                                      |                                     |                    |
| Q\$ <b>Q+</b> Q=Q\$      | ⇔ 0.5s             | ls 1.5s                           | 2s                                 | 2.5s 3s                                         | 3.5s                               | 4s 4                                            | .5s 5s                                | 5.5s                                 | 6s <b>+ </b>                        |                    |
|                          |                    |                                   |                                    |                                                 |                                    |                                                 |                                       |                                      |                                     |                    |
| Thread (0x<br>Hardware E |                    |                                   | <u></u>                            | <u></u>                                         |                                    |                                                 |                                       |                                      | R<br>■ W ■ R<br>■ W ₩ Hardw         | unning<br>ardwa    |

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

ait(a

49

## Hardware event-based sampling -- Example

| <b>n</b>                   |                     | /nfs/1  | fx/home/lwbau      | gh/intel/amplxe/Projec                       | ts/Yea! - Inte | el VTune <u>Ar</u> | nplifier X | E 2011                                 |          |           |                  |                                                     |
|----------------------------|---------------------|---------|--------------------|----------------------------------------------|----------------|--------------------|------------|----------------------------------------|----------|-----------|------------------|-----------------------------------------------------|
| <br><u>F</u> ile Help      |                     |         |                    |                                              |                |                    |            |                                        |          |           |                  |                                                     |
| 💹 🖆 🖬 🗫 🗌                  | > 🗳 🕕               |         |                    |                                              |                |                    |            |                                        |          |           |                  |                                                     |
| r003ge 🙁                   |                     |         |                    |                                              |                |                    |            |                                        |          |           |                  |                                                     |
| 🖉 General Exp              | loration - Ger      | neral E | xploration ,       | k 🧿                                          |                |                    |            |                                        |          | Inte      | el VTune A       | mplifier XE 201                                     |
| 🛛 \varTheta Analysis Targe | t 🔺 Analysis Typ    | e 🕺 Sı  | ımmary 🔗 Bo        | ottom-up                                     |                |                    |            |                                        |          |           |                  |                                                     |
|                            | ard                 |         | uOps lss           | ued by Retiring uOps                         |                |                    |            |                                        | JOps Not | Issued (S | talls) by Bac    | k-end Bound uOps                                    |
|                            |                     | CPI     | Retiring           | 2 2 1                                        | 1              |                    | 1          | Back-end B                             | Bound u  | Ops by Me | mory Latend      | y .                                                 |
| /Functior                  | RED                 |         | uOps by<br>Assists | Bad Speculation uOps<br>by Branch Mispredict |                |                    |            |                                        |          |           | eissues by L 🖾 D |                                                     |
|                            |                     |         |                    |                                              | LLC Miss       |                    |            |                                        |          |           |                  | Split 4K Ac                                         |
| main                       | 0,000               | 0 0.921 | 0.298              | 0.00                                         | 0.388          | 0.259 00           | 23 0.000   | 0.000 0.                               | .948 0.0 | 00 0.000  | 0.00 0.0         | 00 0.000 0.019 0.0                                  |
| Se                         | ected 1 row(s): 20, |         |                    |                                              | 111            |                    | A.         |                                        |          |           |                  |                                                     |
| Q9 <b>Q+</b> Q=Q#          | 0.5s 1s             | 5       | 1.5s 2             | s 2.5s 3s                                    | 3.5s           | 4s                 | 4.5        | s                                      | 5s       | 5.5s      | 65 <b>+</b>      | ✓ Threads                                           |
| Hardware E                 |                     |         | n n n da se base   |                                              | . I            |                    | -l         | ······································ | b        |           | ×                | ✓ ■ Running ✓ 地址 Hardware ✓ Hardware Ev 地址 Hardware |
| No filters are applied     | . 🍀 Module: [All]   |         | Tr                 | read: [All]                                  | <b>•</b> ]     |                    |            | Timelir                                | ne Hardw | are Event | CPU_CLK_         | JNHALTED.TH                                         |
| Developer                  | S Software & :      | Servic  | es Group, E        | Developer Product                            | s Division     | HIC STAT           | Optimizat  | tion Notice                            |          | Roc       | k vou            | r code.                                             |

## **AVX Cloth Sample**

Maximizing Throughput and exploiting 8-wide SIMD in practice

- Cloth Simulation Background
- Distance Constraint Update (Key Hotspot)
- Aligned Data and picked working-set sizes to fit cache
- Ordering the constraints to avoid data dependency
- Mapping to 8-float SIMD with SOA
- AVX transpose to AOS vertex buffer



Developers Software & Services Group, Developer Products Division Copyright® 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## Demo of VTune Amplifier XE on AVX Cloth





(OD)-5.1

Before optimization

oit(denses)

After optimization

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

Lorri lorri

#### Rock your code.

STECTORIES.

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## Tools help in tuning performance



- Not automatically.. yet!
  - This is a Hard Problem: what does 'run faster' mean? What behavior is correct and what incorrect?
- They show where code is slow, and why it's slow
  - Hotspots, the fundamental unit of performance tuning
- Performance tuning is an iterative process
  - Phases of analysis, using tools like VTune Amplifier XE, alternate with phases of contemplation and code editing
- In the end, developers must decide their goal

210PELS Software & Services Group, Developer Products Division Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners **Optimization Notice** 

Rock your code.

# Fast Code == More Stuff == More Fun



- Optimize code to harness modern CPU power
  - Saturate execution ports with work every cycle
- Do big pieces of work Consider an AVX build of your app
  - do 8 at a time and fully utilize AVX SIMD
  - SOA if possible (static or on-the-fly data transpose)
  - Pair 4D SIMD patterns otherwise
- Sanity check the source code (and perhaps assembly) for obvious inefficiencies
  - With timing or VTune Amplifier analysis, verify program flow is optimal
  - Watch for cache misses, branch prediction misses, port underutilization

IOPETS Software & Services Group, Developer Products Division Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners



# **Resources: Programming with AVX**



For AVX support

- Intel 2<sup>nd</sup> Generation Core family, AMD's upcoming CPU
- Windows 7 SP1
- Visual Studio 2010 SP1
- Intel<sup>®</sup> Composer XE (Intel Compiler 12.0)

**Performation** Products Division

Copyright<sup>®</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners







- AVX Cloth demo
- Intel VTune Amplifier XE Performance Profiler

Developers Software & Services Group, Developer Products Division

Copyright\* 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

**Optimization Notice** 

Cri ltor



### **Legal Disclaimers**



INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. The Intel processor and/or chipset products referenced in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. All dates provided are subject to change without notice. All dates specified are target dates, are provided for planning purposes only and are subject to change. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

Copyright © 2010, Intel Corporation. All rights reserved.

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

#### Rock your code.

Copyright<sup>e</sup> 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



#### **Optimization Notice**

Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20110228

Developers Software & Services Group, Developer Products Division

**Optimization Notice** 

#### Rock your code.

Copyrighte 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.