Home >System Tutorial >LINUX >A detailed explanation of memory barriers in the Linux kernel
I have read a discussion article about sequential consistency and cache consistency before, and I have a clearer understanding of the difference and connection between these two concepts. In the Linux kernel, there are many synchronization and barrier mechanisms, which I would like to summarize here.
Before I always thought that many mechanisms in Linux were to ensure cache consistency, but in fact, most of the cache consistency is achieved by hardware mechanisms. Only when using instructions with a lock prefix, it has something to do with caching (although this is definitely not strict, but from the current point of view, this is the case in most cases). Most of the time, we want to ensure sequential consistency.
Cache consistency means that in a multi-processor system, each CPU has its own L1 cache. Since the contents of the same piece of memory may be cached in the L1 cache of different CPUs, when a CPU changes its cached content, it must ensure that another CPU can also read the latest content when reading this data. But don't worry, this complex work is completely done by the hardware. By implementing the MESI protocol, the hardware can easily complete the cache coherency work. Even if multiple CPUs write at the same time, there will be no problem. Whether it is in its own cache, the cache of other CPUs, or in memory, a CPU can always read the latest data. This is how cache consistency works.
The so-called sequential consistency refers to a completely different concept from cache consistency, although they are both products of processor development. Because compiler technology continues to evolve, it may change the order of certain operations in order to optimize your code. The concepts of multi-issue and out-of-order execution have long been present in processors. The result is that the actual order of instructions executed will be slightly different from the order of execution of the code during programming. Of course, this is nothing under a single processor. After all, as long as your own code does not pass, no one will care. The compiler and processor disrupt the order of execution while ensuring that their own code cannot be discovered. But this is not the case with multiprocessors. The order in which instructions are completed on one processor may have a great impact on the code executed on other processors. Therefore, there is the concept of sequential consistency, which ensures that the execution order of threads on one processor is the same from the perspective of threads on other processors. The solution to this problem cannot be solved by the processor or compiler alone, but requires software intervention.
The method of software intervention is also very simple, that is, inserting a memory barrier. In fact, the term memory barrier was coined by processor developers, which makes it difficult for us to understand. Memory barriers can easily lead us to cache consistency, and even doubt whether we can do this to allow other CPUs to see the modified cache. It is wrong to think so. The so-called memory barrier, from a processor perspective, is used to serialize read and write operations. From a software perspective, it is used to solve the problem of sequential consistency. Doesn’t the compiler want to disrupt the order of code execution? Doesn’t the processor want to execute the code out of order? When you insert a memory barrier, it is equivalent to telling the compiler that the order of instructions before and after the barrier cannot be reversed. It tells the processor that it can only wait for the instructions before the barrier. After the instruction is executed, the instruction behind the barrier can begin to be executed. Of course, memory barriers can stop the compiler from messing around, but the processor still has a way. Isn't there a concept of multi-issue, out-of-order execution, and sequential completion in the processor? During the memory barrier, it only needs to ensure that the read and write operations of the previous instructions must be completed before the read and write operations of the following instructions are completed. Therefore, there are three types of memory barriers: read barriers, write barriers, and read-write barriers. For example, before x86, write operations were guaranteed to be completed in order, so write barriers were not needed. However, some ia32 processors now have write operations that are completed out of order, so write barriers are also needed.
In fact, in addition to special read-write barrier instructions, there are many instructions that are executed with read-write barrier functions, such as instructions with a lock prefix. Before the emergence of special read and write barrier instructions, Linux relied on lock to survive.
As for where to insert the read and write barriers, it depends on the needs of the software. The read-write barrier cannot fully achieve sequential consistency, but the thread on the multi-processor will not always stare at your execution order. As long as it ensures that when it looks over, it thinks that you comply with the sequential consistency, the execution will not cause you There are no unexpected situations in the code. The so-called unexpected situation, for example, your thread first assigns a value to variable a, and then assigns a value to variable b. As a result, threads running on other processors look over and find that b has been assigned a value, but a has not been assigned a value. (Note This inconsistency is not caused by cache inconsistency, but by the inconsistency in the order in which the processor write operations are completed). In this case, a write barrier must be added between the assignment of a and the assignment of b.
With SMP, threads start running on multiple processors at the same time. As long as it is a thread, there are communication and synchronization requirements. Fortunately, the SMP system uses shared memory, which means that all processors see the same memory content. Although there is an independent L1 cache, cache consistency processing is still handled by the hardware. If threads on different processors want to access the same data, they need critical sections and synchronization. What synchronization depends on? In the UP system before, we relied on semaphores at the top and turned off interrupts and read, modify and write instructions at the bottom. Now in SMP systems, turning off interrupts has been abolished. Although it is still necessary to synchronize threads on the same processor, it is no longer enough to rely on it alone. Read modify write instructions? Not anymore. When the read operation in your instruction is completed and the write operation is not carried out, another processor may perform a read operation or write operation. The cache coherence protocol is advanced, but it is not yet advanced enough to predict which instruction issued this read operation. So x86 invented instructions with lock prefix. When this instruction is executed, all cache lines containing the read and write addresses in the instruction will be invalidated and the memory bus will be locked. In this way, if other processors want to read or write the same address or the address on the same cache line, they can neither do it from the cache (the relevant line in the cache has expired), nor can they do it from the memory bus (the entire memory bus has failed). locked), finally achieving the goal of atomic execution. Of course, starting from the P6 processor, if the address to be accessed by the lock prefix instruction is already in the cache, there is no need to lock the memory bus and the atomic operation can be completed (although I suspect this is because of the addition of the internal common function of the multi-processor). Because of the L2 cache).
Because the memory bus will be locked, unfinished read and write operations will be completed before the instruction with the lock prefix is executed, which also functions as a memory barrier.
Nowadays, the synchronization of threads between multi-processors uses spin locks at the top and read, modify and write instructions with lock prefix at the bottom. Of course, the actual synchronization also includes disabling the task scheduling of the processor, adding task off interrupts, and adding a semaphore outside. The implementation of this kind of spin lock in Linux has gone through four generations of development and has become more efficient and powerful.
\#ifdef CONFIG_SMP \#define smp_mb() mb() \#define smp_rmb() rmb() \#define smp_wmb() wmb() \#else \#define smp_mb() barrier() \#define smp_rmb() barrier() \#define smp_wmb() barrier() \#endif
CONFIG_SMP就是用来支持多处理器的。如果是UP(uniprocessor)系统,就会翻译成barrier()。
#define barrier() asm volatile(“”: : :”memory”)
barrier()的作用,就是告诉编译器,内存的变量值都改变了,之前存在寄存器里的变量副本无效,要访问变量还需再访问内存。这样做足以满足UP中所有的内存屏障。
\#ifdef CONFIG_X86_32 /* \* Some non-Intel clones support out of order store. wmb() ceases to be a \* nop for these. */ \#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2) \#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2) \#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM) \#else \#define mb() asm volatile("mfence":::"memory") \#define rmb() asm volatile("lfence":::"memory") \#define wmb() asm volatile("sfence" ::: "memory") \#endif
如果是SMP系统,内存屏障就会翻译成对应的mb()、rmb()和wmb()。这里CONFIG_X86_32的意思是说这是一个32位x86系统,否则就是64位的x86系统。现在的linux内核将32位x86和64位x86融合在同一个x86目录,所以需要增加这个配置选项。
可以看到,如果是64位x86,肯定有mfence、lfence和sfence三条指令,而32位的x86系统则不一定,所以需要进一步查看cpu是否支持这三条新的指令,不行则用加锁的方式来增加内存屏障。
SFENCE,LFENCE,MFENCE指令提供了高效的方式来保证读写内存的排序,这种操作发生在产生弱排序数据的程序和读取这个数据的程序之间。 SFENCE——串行化发生在SFENCE指令之前的写操作但是不影响读操作。 LFENCE——串行化发生在SFENCE指令之前的读操作但是不影响写操作。 MFENCE——串行化发生在MFENCE指令之前的读写操作。 sfence:在sfence指令前的写操作当必须在sfence指令后的写操作前完成。 lfence:在lfence指令前的读操作当必须在lfence指令后的读操作前完成。 mfence:在mfence指令前的读写操作当必须在mfence指令后的读写操作前完成。
至于带lock的内存操作,会在锁内存总线之前,就把之前的读写操作结束,功能相当于mfence,当然执行效率上要差一些。
说起来,现在写点底层代码真不容易,既要注意SMP问题,又要注意cpu乱序读写问题,还要注意cache问题,还有设备DMA问题,等等。
多处理器间同步的实现
多处理器间同步所使用的自旋锁实现,已经有专门的文章介绍
The above is the detailed content of A detailed explanation of memory barriers in the Linux kernel. For more information, please follow other related articles on the PHP Chinese website!