Login

I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

int countBit1Fast(int n)
{
int c = 0;
for (; n; ++c)
n &= n - 1;
return c;
}

But a friend told me that `__builtin__popcount(int)` is a lot faster, but less portable. I give it a try and was MANY times faster! Why it's so fast? I want to count bits as fast as possible, but without stick to a particular compiler.

**EDIT:** I may use it on PIC micro-controllers and maybe on non-intel processors, so I need the maximum portability.

The `__builtin__popcount(unsigned int)` is so fast because it is a gcc extension that utilizes a builtin hardware instruction. If you are willing to trade architecture portability for compiler portability, look into the just-as-fast intel intrinsic functions, specifically:

_mm_popcnt_u32(unsigned __int32);
_mm_popcnt_u64(unsigned __int64);

You must then include the `<mmintrin.h>` header file to use these intrinsic functions, however they will work with non-gcc compilers. You may also have to supply a target architecture to get the functions to inline (which is strictly required), using something like `-march=native`.

As others have mentioned, `__buildin__popcount()` is fast because it uses a single x86 instruction.

If you want something faster than what you have that doesn't use anything processor or compiler specific you can create a lookup table with 256 entries:

int bitcount[] = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};

Then use that to get the bit count of each byte:

int countBit1Fast(int n)
{
int i, count = 0;
unsigned char *ptr = (unsigned char *)&n;
for (i=0;i<sizeof(int);i++) {
count += bitcount[ptr[i]];
}
return count;
}

As others have mentioned, on x86_64 you have a popcount CPU instruction that will thoroughly trounce any software implementation.

In the absence of a CPU popcount instruction, which method is fastest depends on word size, lookup speed (which may depend on CPU cache behaviour), and the effectiveness of super-scalar pipelining.

The simple approach of taking each byte, looking it up in a table, and adding together these values is quite quick, taking about `ceil(num_bits/8)*3-1)` operations, depending on how "array fetch" works.

There's another less well known method that works by grouping bits into runs, then repeatedly creating half as many runs that are twice the size as before, where each run contains the sum of two previous runs.

This algorithm takes 4×log₂(num_bits))-1 steps, which means it performs comparatively poorly for small integer sizes, but improves for larger ones:

| int size (bits) | ops (lookup) | ops (run-add) |
| - | - | - |
| 8 | 2 | 11 |
| 16 | 5 | 15 |
| 32 | 11 | 19 |
| 64 | 23 | 23 |
| 128 | 47 | 27 |
| 256 | 95 | 31 |

Initially you start with every bit in its own run; then you take pairs of sets and add them together, so each is a number from 0 to 2 inclusive, which conveniently fits in a 2-bit unsigned int:
```C
x = (x >> 1 & 0x55555555555555555555555555555555)
+(x & 0x55555555555555555555555555555555);
```
Now every pair of bits contains a number from 0 to 2, indicating how many bits used to be set in that pair.
The subsequent steps are then fairly straightforward: combine adjacent runs into new runs that are twice the width:
```C
x = (x >> 2 & 0x33333333333333333333333333333333)
+(x & 0x33333333333333333333333333333333);
```
Now each run of 4 bits contains a number from 0 to 4. Since those numbers fit in 3 bits, the top bit of each run will always be 0, and doesn't need to be included in the mask.
```C
x = (x >> 4 & 0x07070707070707070707070707070707)
+(x & 0x07070707070707070707070707070707);
```
Now each run of 8 bits contains a number from 0 to 8. Since those numbers fit in 4 bits, the top 12 bits of each run will always be 0, and don't need to be included in the mask.
```C
x = (x >> 8 & 0x000f000f000f000f000f000f000f000f)
+(x & 0x000f000f000f000f000f000f000f000f);
```
Now each run of 16 bits contains a number from 0 to 16. Since those numbers fit in 5 bits, the top 27 bits of each run will always be 0, and don't need to be included in the mask.
```C
x = (x >>16 & 0x0000001f0000001f0000001f0000001f)
+(x & 0x0000001f0000001f0000001f0000001f);
```
Now each run of 32 bits contains a number from 0 to 32. Since those numbers fit in 6 bits, the top 58 bits of each run will always be 0, and don't need to be included in the mask.
```C
x = (x >>32 & 0x000000000000003f000000000000003f)
+(x & 0x000000000000003f000000000000003f);
```
Now each run of 64 bits contains a number from 0 to 64. Since those numbers fit in 7 bits, the top 121 bits of each run will always be 0, and don't need to be included in the mask.
```C
x = (x >>64 & 0x0000000000000000000000000000007f)
+(x & 0x0000000000000000000000000000007f);
```

In general, for step `i`, pre-compute
```C
w0 = 1<<i; /* number of bits per run for THIS cycle */
w1 = 1<<i+1; /* number of bits per run for NEXT cycle */
r1 = w1-1; /* mask for a number from 0 .. w0 inclusive */

/* Create a pattern of bits with a 1 every w1 bits: */
m1 = 1 << w1;
m3 = UINTMAX / (m1 - 1);

m4 = m3 * r1;

shift[i] = w0;
mask[i] = m4;

/* for the variant below */
m0 = 1 << w0;
s_mult[i] = m0 - 1;
```
and then for each step use:
```C
x = (x >> shift[i] & mask[i])
+(x & mask[i]);
```

Depending on how fast your CPU can do multiplication, this might make better use of pipelining:
```C
x -= x >> 1 & 0x55555555555555555555555555555555;
x -= (x >> 2 & 0x33333333333333333333333333333333) * 3;
x -= (x >> 4 & 0x07070707070707070707070707070707) * 0xf;
x -= (x >> 8 & 0x000f000f000f000f000f000f000f000f) * 0xff;
x -= (x >>16 & 0x0000001f0000001f0000001f0000001f) * 0xffff;
x -= (x >>32 & 0x000000000000003f000000000000003f) * 0xffffffff;
x -= (x >>64 & 0x0000000000000000000000000000007f) * 0xffffffffffffffff;

y -= (x >> shift[i] & mask[i]) * s_mult[i];
```

> I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

I don't see why anyone would characterize your approach as "very fast". It's a bit clever, and it should be faster on average than naive alternatives. It also does not depend on the width of the representation of `int`, which is a plus. I observe that it has undefined behavior for negative arguments, but that's a common theme for bitwise operators and functions.

Let's analyze, supposing a non-negative argument:

int c = 0;
for (; n; ++c)
n &= n - 1;

- How many loop iterations are performed?

1 for each 1 bit in the binary representation of the value, irrespective of *where* in the value each bit lies

- How much work is performed per iteration

- one increment of `c`
- one comparison of `n` against zero (plus one more of these when breaking out of the loop)
- one decrement of `n` by 1
- one bitwise 'and'

That ignores reads and stores, which very likely can be made free or especially cheap by keeping the operands in registers. If we assume equal cost for each of those, that's four operations per iteration. For random 32-bit integers, there will be an average of 16 iterations, for a total of **65 operations on average**. (Best case is just one operation, but worst is 129, which is no better than a naive implementation).

`__builtin_popcount()`, on the other hand, uses **a single instruction** regardless of input on platforms that support it, such as yours very likely is. Even on those that don't have a for-purpose instruction, however, it can be done faster (on average).

@dbush has presented one such mechanism that has similar advantages to the one you present. In particular, it does not depend on a pre-chosen integer width, and although it does depend on *where* in the representation the 1 bits reside, it does run faster for some arguments (smaller ones) than others. If I'm counting right, that one will average **around 20 operations** on random 32-bit inputs: five in each of four loop iterations (only 0.4% of random inputs would require fewer than four iterations). I'm counting one table read per iteration there, which I assume can be served from cache, but which is probably still not as fast as an arithmetic operation on values already held in registers.

One that is strictly computational would be:

int countBit1Fast(uint32_t n) {
n = (n & 0x55555555u) + ((n >> 1) & 0x55555555u);
n = (n & 0x33333333u) + ((n >> 2) & 0x33333333u);
n = (n & 0x0f0f0f0fu) + ((n >> 4) & 0x0f0f0f0fu);
n = (n & 0x00ff00ffu) + ((n >> 8) & 0x00ff00ffu);
n = (n & 0x0000ffffu) + ((n >>16) & 0x0000ffffu);
return n;
}

That's pretty easy to count: five additions, five shifts, and ten bitwise 'and' operations, and 5 loads of constants for a total of **25 operations** for every input (and it goes up only to 30 for 64-bit inputs, though those are now 64-bit operations instead of 32-bit ones). This version is, however, intrinsically dependent on a particular size of the input data type.

Blithe564586

marvellqvycaplv

rhinestone467841

dictativezte

algebraical474805