Popcount

I just finished working on "popcount", the classic problem of counting the number of 1 bits in a machine word. I tell the story of what I did and found at BartForge. Friend of Bart

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Is simple faster?

Back when I was playing with this (at Reed!), the fastest approach I found was to create a 256-entry table of the number of ones in a byte, and then just add them all up. The shifts, masks, and loads can all be performed in parallel, the adding up of the ones for each byte is the only data dependency. 'course, I don't think the word "superscalar" existed when I was playing with this, so YMMV.

Table-driven popcount

I added 8 and 16 bit table driven popcount to the benchmarks. These perform the fastest in the benchmark test-stand, at about 3ns/iteration. They're about the same speed, so one would obviously use the 8-bit version.

The problem with this is that I don't believe it. It's amazing that Intel's L1 cache is so cheap. However, one can expect that the table lookup would occasionally miss this small cache during normal usage, which would be devastating to the performance of this approach. The tight loop here is guaranteeing us that the table becomes locked in cache. Maybe when I get some more energy I'll create a more realistic testbench, and we'll see how the table-driven algorithms compare to the pure-computational ones in 2008.

Thanks much for the suggestion!

Don't mask, cast

BTW, I didn't mask my table indices, I used a cast. This should be faster on the x86 with a bad compiler, and no slower anywhere.