Home > Article > Operation and Maintenance > The connection between bit operations and nginx performance

The connection between bit operations and nginx performance

王林forward: 2021-01-12 09:53:292214browse

We all know that nginx is famous for its high performance, which is mainly due to the source code of nginx. In this article, we will talk about the connection between bit operations and nginx high performance.

(Learning video sharing: Programming video)

Bit operations can be seen everywhere in the source code of Nginx, from defining the type of instruction (how many parameters can be carried, It can appear under which configuration blocks), to mark whether there is unsent data in the current request, and to use the lowest bit of the pointer in the Nginx event module to mark whether an event has expired, all of which reflect the magic and charm of bit operations.

This article will introduce and analyze some classic bit operations in the Nginx source code, and expand to introduce some other bit operation techniques.

Alignment

When Nginx internally allocates memory, it pays great attention to the alignment of the memory starting address, that is, memory alignment (which can lead to some performance improvements), which is consistent with the processor's search. It is related to the addressing characteristics. For example, some processors will address according to 4-byte width. On such a machine, it is assumed that 4 bytes starting from 0x46b1e7 need to be read, because 0x46b1e7 is not on a 4-byte boundary (0x46b1e7 % 4 = 3), so when reading, it will be read in two times. The first time is to read the 4 bytes starting from 0x46b1e4 and take out the lower 3 bytes; then read the 4 words starting from 0x46b1e8 section, take out the highest byte. We know that the speed of reading and writing main memory cannot match the CPU, so two reads obviously bring greater overhead, which will cause instruction stalls, increase CPI (cycles per instruction), and harm the performance of the application.

So Nginx encapsulates a macro specifically for alignment operations.

#define ngx_align(d, a)     (((d) + (a - 1)) & ~(a - 1))

As shown in the above code, this macro causes d to be aligned by a, where a must be a power of 2.

For example, when d is 17 and a is 2, you get 18; when d is 15 and a is 4, you get 16; when d is 16 and a is 4, you get 16.

This macro is actually looking for multiples of the first a that are greater than or equal to d. Since a is a power of 2, the binary representation of a is in the form 00...1...00, that is, it has only one 1, so a - 1 is 00...01...1. format, then ~(a - 1) will set all the low n bits to 0, where n is the number of consecutive 0's in the low bits of a. So at this time, if we perform a bitwise AND operation between d and ~(a - 1), we can clear the lower n bits of d. Since we need to find a number greater than or equal to d, we use d (a - 1) That’s it.

Bitmap

Bitmap is usually used to mark the status of things. "Bit" is reflected in the fact that each thing is marked with only one bit, which saves memory and improves performance. .

There are many examples of using bitmaps in Nginx, such as its shared memory allocator (slab), and when escaping uri (Uniform Resource Identifier), you need to determine whether a character is a reserved character. characters (or unsafe characters), such characters need to be escaped into %XX.

static uint32_t   uri_component[] = {
        0xffffffff, /* 1111 1111 1111 1111  1111 1111 1111 1111 */

/* ?>=< ;:98 7654 3210  /.-, +*)( &#39;&%$ #"!  */
        0xfc009fff, /* 1111 1100 0000 0000  1001 1111 1111 1111 */

/* _^]\ [ZYX WVUT SRQP  ONML KJIH GFED CBA@ */
        0x78000001, /* 0111 1000 0000 0000  0000 0000 0000 0001 */

/*  ~}| {zyx wvut srqp  onml kjih gfed cba` */
        0xb8000001, /* 1011 1000 0000 0000  0000 0000 0000 0001 */

        0xffffffff, /* 1111 1111 1111 1111  1111 1111 1111 1111 */
        0xffffffff, /* 1111 1111 1111 1111  1111 1111 1111 1111 */
        0xffffffff, /* 1111 1111 1111 1111  1111 1111 1111 1111 */
        0xffffffff  /* 1111 1111 1111 1111  1111 1111 1111 1111 */
    };

As shown above, a simple array forms a bitmap, containing a total of 8 numbers, each number represents 32 states, so this bitmap contains 256 characters (including extended ASCII code) . A bit of 0 represents a normal character, that is, no escaping is required, while a bit of 1 represents a character that needs to be escaped.

So how to use this bitmap? When Nginx traverses the uri, it makes a judgment through a simple statement.

uri_component[ch >> 5] & (1U << (ch & 0x1f))

As shown above, ch represents the current character, ch >> 5 is to shift ch to the right by 5 bits, which has the effect of dividing by 32. This step determines the position of ch in the uri_component. on several numbers; and on the right, (ch & 0x1f) takes out the value of the lower 5 bits of ch, which is equivalent to taking modulo 32. This value indicates which bit of the corresponding number ch is in (calculated from low to high ); Therefore, after performing a bitwise AND operation on the left and right values, the bitmap state where the ch character is located is taken out. For example, ch is '0' (that is, the number 48), which exists on the 2nd number of the bitmap (48 >> 5 = 1), and is on the 16th bit of this number (0xfc009fff), so its The status is 0xfc009fff & 0x10000 = 0, so '0' is a universal character and does not need to be escaped.

From the above example, we can also see another bit operation technique, that is, when performing modulo or division operations on a number that is a power of 2, it can also be implemented through bit operations. This has better performance than direct division and modulo operations, although with the right optimization level, the compiler may also perform this optimization for us.

Find the position of the lowest bit 1

Then let’s introduce some other application skills.

To find the position of the lowest 1 in a digital binary, you may intuitively think of bitwise traversal. The time complexity of this algorithm is O(n), and the performance is not satisfactory.

如果你曾经接触过树状数组，你可能就会对此有不同的看法，树状数组的一个核心概念是计算 lowbit，即计算一个数字二进制里最低位 1 的幂次。它之所以有着不错的时间复杂度（O(logN)），便是因为能够在 O(1) 或者说常数的时间内得到答案。

int lowbit(int x)
{
    return x & ~(x - 1);
}

这个技巧事实上和上述对齐的方式类似，比如 x 是 00...111000 这样的数字，则 x - 1 就成了 00...110111，对之取反，则把原本 x 低位连续的 0 所在的位又重新置为了 0（而原本最低位 1 的位置还是为 1），我们会发现除了最低位 1 的那个位置，其他位置上的值和 x 都是相反的，因此两者进行按位与操作后，结果里只可能有一个 1，便是原本 x 最低位的 1。

寻找最高位 1 的位置

换一个问题，这次不是寻找最低位，而是寻找最高位的 1。

这个问题有着它实际的意义，比如在设计一个 best-fit 的内存池的时候，我们需要找到一个比用户期望的 size 大的第一个 2 的幂次。

同样地，你可能还是会先想到遍历。

事实上 Intel CPU 指令集有这么一条指令，就是用以计算一个数二进制里最高位 1 的位置。

size_t bsf(size_t input)
{
    size_t pos;

    __asm__("bsfq %1, %0" : "=r" (pos) : "rm" (input));

    return pos;
}

这很好，但是这里我们还是期望用位运算找到这个 1 的位置。

size_t bsf(size_t input)
{
    input |= input >> 1;
    input |= input >> 2;
    input |= input >> 4;
    input |= input >> 8;
    input |= input >> 16;
    input |= input >> 32;

    return input - (input >> 1);
}

这便是我们所期望的计算方式了。我们来分析下这个计算的原理。

需要说明的是，如果你需要计算的值是 32 位的，则上面函数的最后一步 input |= input >> 32 是不需要的，具体执行多少次 input |= input >> m，是由 input 的位长决定的，比如 8 位则进行 3 次，16 位进行 4 次，而 32 位进行 5 次。

为了更简洁地进行描述，我们用 8 位的数字进行分析，设一个数 A，它的二进制如下所示。

A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0]

上面的计算过程如下。

A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0]
0    A[7] A[6] A[5] A[4] A[3] A[2] A[1]
---------------------------------------
A[7] A[7]|A[6] A[6]|A[5] A[5]|A[4] A[4]|A[3] A[3]|A[2] A[2]|A[1] A[1]|A[0]
0    0         A[7]      A[7]|A[6] A[6]|A[5] A[5]|A[4] A[4]|A[3] A[3]|A[2]
--------------------------------------------------------------------------
A[7] A[7]|A[6] A[7]|A[6]|A[5] A[7]|A[6]|A[5]|A[4] A[6]|A[5]|A[4]|A[3] A[5]|A[4]|A[3]|A[2] A[4]|A[3]|A[2]|A[1] A[3]|A[2]|A[1]|A[0]
0    0         0              0                   A[7]                A[7]|A[6]           A[7]|A[6]|A[5]      A[7]|A[6]|A[5]|A[4]
---------------------------------------------------------------------------------------------------------------------------------
A[7] A[7]|A[6] A[7]|A[6]|A[5]  A[7]|A[6]|A[5]|A[4] A[7]|A[6]|A[5]|A[4]|A[3] A[7]|A[6]|A[5]|A[4]|A[3]|A[2] A[7]|A[6]|A[5]|A[4]|A[3]|A[2]|A[1] A[7]|A[6]|A[5]|A[4]|A[3]|A[2]|A[1]|A[0]

我们可以看到，最终 A 的最高位是 A[7]，次高位是 A[7]|A[6]，第三位是 A[7]|A[6]|A[5]，最低位 A[7]|A[6]|A[5]|A[4]|A[3]|A[2]|A[1]|A[0]

假设最高位的 1 是在第 m 位（从右向左算，最低位称为第 0 位），那么此时的低 m 位都是 1，其他的高位都是 0。也就是说，A 将会是 2 的某幂再减一，于是最后一步（input - (input >> 1)）的用意也就非常明显了，即将除最高位以外的 1 全部置为 0，最后返回的便是原来的 input 里最高位 1 的对应幂了。

计算 1 的个数

如何计算一个数字二进制表示里有多少个 1 呢？

直觉上可能还是会想到遍历（遍历真是个好东西），让我们计算下复杂度，一个字节就是 O(8)，4 个字节就是 O(32)，而 8 字节就是 O(64)了。

如果这个计算会频繁地出现在你的程序里，当你在用 perf 这样的性能分析工具观察你的应用程序时，它或许就会得到你的关注，而你不得不去想办法进行优化。

事实上《深入理解计算机系统》这本书里就有一个这个问题，它要求计算一个无符号长整型数字二进制里 1 的个数，而且希望你使用最优的算法，最终这个算法的复杂度是 O(8)。

long fun_c(unsigned long x)
{
    long val = 0;
    int i;
    for (i = 0; i < 8; i++) {
        val += x & 0x0101010101010101L;
        x >>= 1;
    }

    val += val >> 32;
    val += val >> 16;
    val += val >> 8;

    return val & 0xFF;
}

这个算法在我的另外一篇文章里曾有过分析。

观察 0x0101010101010101 这个数，每 8 位只有最后一位是 1。那么 x 与之做按位与，会得到下面的结果：

设 A[i] 表示 x 二进制表示里第 i 位的值（0 或 1）。
第一次：
A[0] + (A[8] << 8) + (A[16] << 16) + (A[24] << 24) + (A[32] << 32) + (A[40] << 40) + (A[48] << 48) + (A[56] << 56)
第二次：
A[1] + (A[9] << 8) + (A[17] << 16) + (A[25] << 24) + (A[33] << 32) + (A[41] << 40) + (A[49] << 48) + (A[57] << 56)
......
第八次：
A[7] + (A[15] << 8) + (A[23] << 16) + (A[31] << 24) + (A[39] << 32) + (A[47] << 40) + (A[55] << 48) + (A[63] << 56)
相加后得到的值为：
(A[63] + A[62] + A[61] + A[60] + A[59] + A[58] + A[57] + A[56]) << 56 +
(A[55] + A[54] + A[53] + A[52] + A[51] + A[50] + A[49] + A[48]) << 48 +
(A[47] + A[46] + A[45] + A[44] + A[43] + A[42] + A[41] + A[40]) << 40 +
(A[39] + A[38] + A[37] + A[36] + A[35] + A[34] + A[33] + A[32]) << 32 +
(A[31] + A[30] + A[29] + A[28] + A[27] + A[26] + A[25] + A[24]) << 24 +
(A[23] + A[22] + A[21] + A[20] + A[19] + A[18] + A[17] + A[16]) << 16 +
(A[15] + A[14] + A[13] + A[12] + A[11] + A[10] + A[9]  + A[8])  << 8  +
(A[7]  + A[6]  + A[5]  + A[4]  + A[3]  + A[2]  + A[1]  + A[0])

之后的三个操作：

val += val >> 32;
val += val >> 16;
val += val >> 8;

每次将 val 折半然后相加。

第一次折半（val += val >> 32）后，得到的 val 的低 32 位：

(A[31] + A[30] + A[29] + A[28] + A[27] + A[26] + A[25] + A[24] + A[63] + A[62] + A[61] + A[60] + A[59] + A[58] + A[57] + A[56]) << 24 +
(A[23] + A[22] + A[21] + A[20] + A[19] + A[18] + A[17] + A[16] + A[55] + A[54] + A[53] + A[52] + A[51] + A[50] + A[49] + A[48]) << 16 +
(A[15] + A[14] + A[13] + A[12] + A[11] + A[10] + A[9]  + A[8] + A[47] + A[46] + A[45] + A[44] + A[43] + A[42] + A[41] + A[40])  << 8  +
(A[7]  + A[6]  + A[5]  + A[4]  + A[3]  + A[2]  + A[1]  + A[0] + A[39] + A[38] + A[37] + A[36] + A[35] + A[34] + A[33] + A[32])

第二次折半（val += val >> 16）后，得到的 val 的低 16 位：

15] + A[14] + A[13] + A[12] + A[11] + A[10] + A[9]  + A[8] + A[47] + A[46] + A[45] + A[44] + A[43] + A[42] + A[41] + A[40] + A[31] + A[30] + A[29] + A[28] + A[27] + A[26] + A[25] + A[24] + A[63] + A[62] + A[61] + A[60] + A[59] + A[58] + A[57] + A[56])  << 8  +
(A[7]  + A[6]  + A[5]  + A[4]  + A[3]  + A[2]  + A[1]  + A[0] + A[39] + A[38] + A[37] + A[36] + A[35] + A[34] + A[33] + A[32] + A[23] + A[22] + A[21] + A[20] + A[19] + A[18] + A[17] + A[16] + A[55] + A[54] + A[53] + A[52] + A[51] + A[50] + A[49] + A[48])

第三次折半（val += val >> 8）后，得到的 val 的低 8 位：

(A[7]  + A[6]  + A[5]  + A[4]  + A[3]  + A[2]  + A[1]  + A[0] + A[39] + A[38] + A[37] + A[36] + A[35] + A[34] + A[33] + A[32] + A[23] + A[22] + A[21] + A[20] + A[19] + A[18] + A[17] + A[16] + A[55] + A[54] + A[53] + A[52] + A[51] + A[50] + A[49] + A[48] + A[15] + A[14] + A[13] + A[12] + A[11] + A[10] + A[9]  + A[8] + A[47] + A[46] + A[45] + A[44] + A[43] + A[42] + A[41] + A[40] + A[31] + A[30] + A[29] + A[28] + A[27] + A[26] + A[25] + A[24] + A[63] + A[62] + A[61] + A[60] + A[59] + A[58] + A[57] + A[56])

可以看到，经过三次折半，64 个位的值全部累加到低 8 位，最后取出低 8 位的值，就是 x 这个数字二进制里 1 的数目了，这个问题在数学上称为“计算汉明重量”。

位运算以它独特的优点（简洁、性能棒）吸引着程序员，比如 LuaJIT 内置了 bit 这个模块，允许程序员在 Lua 程序里使用位运算。学会使用位运算对程序员来说也是一种进步，值得我们一直去研究。

The connection between bit operations and nginx performance

Related articles