CMU 15-418 (Spring 2012) Final Project:
Improving ISPC compilation using profiling output
Nipunn Koorapati

Project Proposal

Checkpoint Report

Final Report

Working Schedule

Week What I Plan To Do What I've Actually Done
Apr 1-7Run x86 code using performance counters Looked at Agner Fog's code. Ran it.
Apr 8-14Run/modify aobench example in ISPC; Got up and running on a Sandy Bride box
Apr 15-21Emit x86 code in ISPC in interesting spots. Output results Ran and worked through ISPCInstrument code
Apr 22-28Analyze results on various ISPC programs
Apr 29-May 5Output results in visual way
May 6-11Write example ISPC programs to show features

Working Log

3/28: I successfully forked, compiled and built ISPC. That's a win! I also looked up information on how performance counters work. There's a lot of useful information in there.

3/31: Emailed Matt Pharr. Got idea of general steps involved. He linked me Agner Fog's code example. Hopefully that should help.

4/12: Downloaded and dug around Agner Fog's code example for using perf counters. His code is pretty complicated, but it boils down to something simple. The RDPMC instruction in x86-64 reads performance monitoring couners. They can be reset with another instruction. Complexity arises when checking for what architecture is being used, but hopefully I won't need to worry about that.

4/20: Made commit to ISPC for suggestions for typo'd labels. This was to get a hang of the code emitting classes within ISPC. I used a string-distance function to find similar labels and suggest them (in goto statements). Honestly, I don't think it helped me that much, but it was fun!

4/21: Ran aobench_instrumented code and looked at the underlying ispc code. I also ran sqrt code from asst1 with --instrumented. It runs WAY WAY slower with instrumentation on. Like 100x slower. Hardware performance counters aren't going to be particularly useful with slowdowns like that. It's possible that the slowdown is due to the function call and once replaced with an inline hardware counter increment, it won't be nearly as slow. Current instrumentation code returns the mask, so we can already look at lane active percentage. IPC would be an interesting extra stat though.

4/24: Just realized why Agner Fog's code was so complex. The RDPMC and RDMSR instructions (for reading performance counters and machine-specific registers) are privileged instructions! He wrote a driver for accessing these registers. Furthermore, if you put bad register values into the driver, the driver will crash (AKA blue screen), so he wrapped the driver in a safe way. My guess is that he didn't want to make the driver robust for performance reasons. I'll send him an email.

5/4: For the last week or so, I've been wrangling with the ideas from my optimization results and use of hardware counters. I've decided that getting hardware counters into ISPC is not really worth it. The important stats that I really want to see are accessible through ISPCInstrument without using hardware counters. Dealing with the drivers/elevated privilege instructions is just not worth it.

Here's my new strategy. I'm going to use instrumentation stats to measure the divergence in various parts of a program and output it to a file. Then, I'm going to make a compiler option to use the profiling run as an additional input to the compiler to optimize. This will use techniques similar to the cif and cwhile and cfor constructs to provide hints to the compiler.

Using the sqrt program I measured the following room for improvement. These are measured on my laptop using SSE2 instructions (4-wide SIMD):
Random input + while: 2.20x speedup
Random input + cwhile: 1.90x speedup
Coherent input + while: 2.09x speedup
Coherent input + cwhile: 2.80x speedup
This looks like there's sufficient room for optimization! Now to get to it!

5/8: I've got a working cwhile/while profiler and added code to ISPC to use that profiling output to aid in deciding whether or not to use use cwhile. I have some metrics on how useful cwhile vs while is relative to the percentage of lane sets that are all-on. As you can see, cwhile is much more useful when allOnPct is high

On 4 lane SIMD using SSE2 (my core 2 duo laptop)

allOnPct    serial      while           cwhile          improvement
43.93%      84.1ms      38.6ms(2.18x)   44.6ms(1.88x)   .86 
58.24%      78.5ms      31.7ms(2.48x)   36.4ms(2.16x)   .87 
69.46%      61.7ms      24.6ms(2.51x)   25.4ms(2.44x)   .97 
82.88%      43.1ms      18.3ms(2.35x)   17.7ms(2.43x)   1.03
86.21%      31.7ms      17.0ms(1.87x)   14.2ms(2.23x)   1.19
95.18%      28.2ms      14.2ms(1.98x)   11.2ms(2.51x)   1.27
100.00%     26.7ms      13.1ms(2.04x)    9.6ms(2.78x)   1.36

On a Core-i7 using AVX (cranked up problem size to make numbers reasonable):

allOnPct    serial      while           cwhile          improvement
49.25%      206.2ms     33.6ms(6.13x)   34.6ms(5.95x)   .97 
61.13%      155.9ms     24.2ms(6.44x)   25.6ms(6.09x)   .95 
73.98%       99.1ms     18.0ms(5.51x)   19.4ms(5.10x)   .93 
81.24%       71.1ms     16.0ms(4.44x)   15.8ms(4.51x)   1.02
90.91%       61.1ms     13.9ms(4.38x)   13.1ms(4.66x)   1.06
100.0%       57.0ms      9.9ms(5.76x)    9.1ms(6.28x)   1.09