|Week||What I Plan To Do||What I've Actually Done|
|Apr 1-7||Run x86 code using performance counters||Looked at Agner Fog's code. Ran it.|
|Apr 8-14||Run/modify aobench example in ISPC;||Got up and running on a Sandy Bride box|
|Apr 15-21||Emit x86 code in ISPC in interesting spots. Output results||Ran and worked through ISPCInstrument code|
|Apr 22-28||Analyze results on various ISPC programs|
|Apr 29-May 5||Output results in visual way|
|May 6-11||Write example ISPC programs to show features|
3/28: I successfully forked, compiled and built ISPC. That's a win! I also looked up information on how performance counters work. There's a lot of useful information in there.
3/31: Emailed Matt Pharr. Got idea of general steps involved. He linked me Agner Fog's code example. Hopefully that should help.
4/12: Downloaded and dug around Agner Fog's code example for using perf counters. His code is pretty complicated, but it boils down to something simple. The RDPMC instruction in x86-64 reads performance monitoring couners. They can be reset with another instruction. Complexity arises when checking for what architecture is being used, but hopefully I won't need to worry about that.
4/20: Made commit to ISPC for suggestions for typo'd labels. This was to get a hang of the code emitting classes within ISPC. I used a string-distance function to find similar labels and suggest them (in goto statements). Honestly, I don't think it helped me that much, but it was fun!
4/21: Ran aobench_instrumented code and looked at the underlying ispc code. I also ran sqrt code from asst1 with --instrumented. It runs WAY WAY slower with instrumentation on. Like 100x slower. Hardware performance counters aren't going to be particularly useful with slowdowns like that. It's possible that the slowdown is due to the function call and once replaced with an inline hardware counter increment, it won't be nearly as slow. Current instrumentation code returns the mask, so we can already look at lane active percentage. IPC would be an interesting extra stat though.
4/24: Just realized why Agner Fog's code was so complex. The RDPMC and RDMSR instructions (for reading performance counters and machine-specific registers) are privileged instructions! He wrote a driver for accessing these registers. Furthermore, if you put bad register values into the driver, the driver will crash (AKA blue screen), so he wrapped the driver in a safe way. My guess is that he didn't want to make the driver robust for performance reasons. I'll send him an email.
5/4: For the last week or so, I've been wrangling with the ideas from my optimization results and use of hardware counters. I've decided that getting hardware counters into ISPC is not really worth it. The important stats that I really want to see are accessible through ISPCInstrument without using hardware counters. Dealing with the drivers/elevated privilege instructions is just not worth it.
Here's my new strategy. I'm going to use instrumentation stats to measure the divergence in various parts of a program and output it to a file. Then, I'm going to make a compiler option to use the profiling run as an additional input to the compiler to optimize. This will use techniques similar to the cif and cwhile and cfor constructs to provide hints to the compiler.
Using the sqrt program I measured the following room for improvement. These
are measured on my laptop using SSE2 instructions (4-wide SIMD):
Random input + while: 2.20x speedup
Random input + cwhile: 1.90x speedup
Coherent input + while: 2.09x speedup
Coherent input + cwhile: 2.80x speedup
This looks like there's sufficient room for optimization! Now to get to it!
5/8: I've got a working cwhile/while profiler and added code to ISPC to use that profiling output to aid in deciding whether or not to use use cwhile. I have some metrics on how useful cwhile vs while is relative to the percentage of lane sets that are all-on. As you can see, cwhile is much more useful when allOnPct is high
On 4 lane SIMD using SSE2 (my core 2 duo laptop) allOnPct serial while cwhile improvement ------------------------------------------------------------------- 43.93% 84.1ms 38.6ms(2.18x) 44.6ms(1.88x) .86 58.24% 78.5ms 31.7ms(2.48x) 36.4ms(2.16x) .87 69.46% 61.7ms 24.6ms(2.51x) 25.4ms(2.44x) .97 82.88% 43.1ms 18.3ms(2.35x) 17.7ms(2.43x) 1.03 86.21% 31.7ms 17.0ms(1.87x) 14.2ms(2.23x) 1.19 95.18% 28.2ms 14.2ms(1.98x) 11.2ms(2.51x) 1.27 100.00% 26.7ms 13.1ms(2.04x) 9.6ms(2.78x) 1.36 On a Core-i7 using AVX (cranked up problem size to make numbers reasonable): allOnPct serial while cwhile improvement ------------------------------------------------------------------- 49.25% 206.2ms 33.6ms(6.13x) 34.6ms(5.95x) .97 61.13% 155.9ms 24.2ms(6.44x) 25.6ms(6.09x) .95 73.98% 99.1ms 18.0ms(5.51x) 19.4ms(5.10x) .93 81.24% 71.1ms 16.0ms(4.44x) 15.8ms(4.51x) 1.02 90.91% 61.1ms 13.9ms(4.38x) 13.1ms(4.66x) 1.06 100.0% 57.0ms 9.9ms(5.76x) 9.1ms(6.28x) 1.09