6 What Programmers Can Do · 软件开发

https://lwn.net/Articles/255364/ # 6 What Programmers Can Do After the descriptions in the previous sections it is clear that there are many, many opportunities for programmers to influence a program's performance, positively or negatively. And this is for memory-related operations only. We will proceed in covering the opportunities from the ground up, starting with the lowest levels of physical RAM access and L1 caches, up to and including OS functionality which influences memory handling. 从直接的物理内存访问到L1 caches，知道OS的功能影响。 ## 6.1 Bypassing the Cache When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective. 当产生的数据不会再次立即用到的时候，存储全部的数据到cache不是最优的选择。这个操作将可能不会很快再次使用的数据从缓存中删除。这特别适合在大数据量的场景，例如被填充后将被使用矩阵。（？？） ***** For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory. 绕过缓存直接操作内存。 ***** This might sound expensive but it does not have to be. The processor will try to use write-combining (see Section 3.3.3) to fill entire cache lines. If this succeeds no memory read operation is needed at all. For the x86 and x86-64 architectures a number of intrinsics are provided by gcc: 在写入的时候会有写入合并策略加快写入速度。 ``` #include <emmintrin.h> void _mm_stream_si32(int *p, int a); void _mm_stream_si128(int *p, __m128i a); void _mm_stream_pd(double *p, __m128d a); #include <xmmintrin.h> void _mm_stream_pi(__m64 *p, __m64 a); void _mm_stream_ps(float *p, __m128 a); #include <ammintrin.h> void _mm_stream_sd(double *p, __m128d a); void _mm_stream_ss(float *p, __m128 a); ``` These instructions are used most efficiently if they process large amounts of data in one go. Data is loaded from memory, processed in one or more steps, and then written back to memory. The data “streams” through the processor, hence the names of the intrinsics. 在处理大量数据的时候尤为高效，数据直接从内存载入，然后写入内存，“streams”类的函数都是类似的。 ***** The memory address must be aligned to 8 or 16 bytes respectively. In code using the multimedia extensions it is possible to replace the normal _mm_store_* intrinsics with these non-temporal versions. In the matrix multiplication code in Section 9.1 we do not do this since the written values are reused in a short order of time. This is an example where using the stream instructions is not useful. More on this code in Section 6.2.1. 内存地址必须是8或者16字节对齐，在代码里使用多媒体扩展代替_mm_store_* 相关的操作，在9.1节的matrix multiplication 代码我们将不会使用，因为写入的值很快将被再次使用。这是一个不适合使用的场景。更多的适合代码在6.2.1 ***** The processor's write-combining buffer can hold requests for partial writing to a cache line for only so long. It is generally necessary to issue all the instructions which modify a single cache line one after another so that the write-combining can actually take place. An example for how to do this is as follows: 处理器的写入合并缓存能保存部分写入缓存行很长时间。 ~~~ #include <emmintrin.h> void setbytes(char *p, int c) { __m128i i = _mm_set_epi8(c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c); _mm_stream_si128((__m128i *)&p[0], i); _mm_stream_si128((__m128i *)&p[16], i); _mm_stream_si128((__m128i *)&p[32], i); _mm_stream_si128((__m128i *)&p[48], i); } ~~~ ***** Assuming the pointer p is appropriately aligned, a call to this function will set all bytes of the addressed cache line to c. The write-combining logic will see the four generated movntdq instructions and only issue the write command for the memory once the last instruction has been executed. To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon. This can have huge benefits in certain situations. An example of everyday code using this technique is thememsetfunction in the C runtime, which should use a code sequence like the above for large blocks. 假设指针p是对齐的，调用这个功能将这个地址的缓存线全部设置为c。写入组合逻辑将看到四个生成的movntdq指令，并且只有在最后一条指令执行之后才为内存发出写入命令。总之，这个代码顺序不嫩