https://lwn.net/Articles/252125/
## 3.3.3 Write Behavior
Before we start looking at the cache behavior when multiple execution contexts (threads or processes) use the same memory we have to explore a detail of cache implementations. Caches are supposed to be coherent and this coherency is supposed to be completely transparent for the userlevel code. Kernel code is a different story; it occasionally requires cache flushes.
在我们开始研究多个线程或进程同时使用相同内存之前,先来看一下缓存实现的一些细节。我们要求缓存是一致的,而且这种一致性必须对用户级代码完全透明。而内核代码则有所不同,它有时候需要对缓存进行转储(flush)。
*****
This specifically means that, if a cache line is modified, the result for the system after this point in time is the same as if there were no cache at all and the main memory location itself had been modified. This can be implemented in two ways or policies:
这意味着,如果对缓存线进行了修改,那么在这个时间点之后,系统的结果应该是与没有缓存的情况下是相同的,即主存的对应位置也已经被修改的状态。这种要求可以通过两种方式或策略实现:
*****
* write-through cache implementation;
* write-back cache implementation.
* 写通(write-through)
* 写回(write-back)
*****
The write-through cache is the simplest way to implement cache coherency. If the cache line is written to, the processor immediately also writes the cache line into main memory. This ensures that, at all times, the main memory and cache are in sync. The cache content could simply be discarded whenever a cache line is replaced. This cache policy is simple but not very fast. A program which, for instance, modifies a local variable over and over again would create a lot of traffic on the FSB even though the data is likely not used anywhere else and might be short-lived.
写通比较简单。当修改缓存线时,处理器立即将它写入主存。这样可以保证主存与缓存的内容永远保持一致。当缓存线被替代时,只需要简单地将它丢弃即可。这种策略很简单,但是速度比较慢。如果某个程序反复修改一个本地变量,可能导致FSB上产生大量数据流,而不管这个变量是不是有人在用,或者是不是短期变量。
*****
The write-back policy is more sophisticated. Here the processor does not immediately write the modified cache line back to main memory. Instead, the cache line is only marked as dirty. When the cache line is dropped from the cache at some point in the future the dirty bit will instruct the processor to write the data back at that time instead of just discarding the content.
写回比较复杂。当修改缓存线时,处理器不再马上将它写入主存,而是打上已弄脏(dirty)的标记。当以后某个时间点缓存线被丢弃时,这个已弄脏标记会通知处理器把数据写回到主存中,而不是简单地扔掉。
*****
Write-back caches have the chance to be significantly better performing, which is why most memory in a system with a decent processor is cached this way. The processor can even take advantage of free capacity on the FSB to store the content of a cache line before the line has to be evacuated. This allows the dirty bit to be cleared and the processor can just drop the cache line when the room in the cache is needed.
写回有时候会有非常不错的性能,因此较好的系统大多采用这种方式。采用写回时,处理器们甚至可以利用FSB的空闲容量来存储缓存线。这样一来,当需要缓存空间时,处理器只需清除脏标记,丢弃缓存线即可。
*****
But there is a significant problem with the write-back implementation. When more than one processor (or core or hyper-thread) is available and accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. Instead the content of the first processor's cache line is needed. In the next section we will see how this is currently implemented.
但写回也有一个很大的问题。当有多个处理器(或核心、超线程)访问同一块内存时,必须确保它们在任何时候看到的都是相同的内容。如果缓存线在其中一个处理器上弄脏了(修改了,但还没写回主存),而第二个处理器刚好要读取同一个内存地址,那么这个读操作不能去读主存,而需要读第一个处理器的缓存线。在下一节中,我们将研究如何实现这种需求。
*****
Before we get to this there are two more cache policies to mention:
* write-combining; and
* uncacheable.
在此之前,还有其它两种缓存策略需要提一下:
* 写入合并
* 不可缓存
*****
Both these policies are used for special regions of the address space which are not backed by real RAM. The kernel sets up these policies for the address ranges (on x86 processors using the Memory Type Range Registers, MTRRs) and the rest happens automatically. The MTRRs are also usable to select between write-through and write-back policies.
这两种策略用于真实内存不支持的特殊地址区,内核为地址区设置这些策略(x86处理器利用内存类型范围寄存器MTRR),余下的部分自动进行。MTRR还可用于写通和写回策略的选择。
*****
Write-combining is a limited caching optimization more often used for RAM on devices such as graphics cards. Since the transfer costs to the devices are much higher than the local RAM access it is even more important to avoid doing too many transfers. Transferring an entire cache line just because a word in the line has been written is wasteful if the next operation modifies the next word. One can easily imagine that this is a common occurrence, the memory for horizontal neighboring pixels on a screen are in most cases neighbors, too. As the name suggests, write-combining combines multiple write accesses before the cache line is written out. In ideal cases the entire cache line is modified word by word and, only after the last word is written, the cache line is written to the device. This can speed up access to RAM on devices significantly.
写入合并是一种有限的缓存优化策略,更多地用于显卡等设备之上的内存。由于设备的传输开销比本地内存要高的多,因此避免进行过多的传输显得尤为重要。如果仅仅因为修改了缓存线上的一个字,就传输整条线,而下个操作刚好是修改线上的下一个字,那么这次传输就过于浪费了。而这恰恰对于显卡来说是比较常见的情形——屏幕上水平邻接的像素往往在内存中也是靠在一起的。顾名思义,写入合并是在写出缓存线前,先将多个写入访问合并起来。在理想的情况下,缓存线被逐字逐字地修改,只有当写入最后一个字时,才将整条线写入内存,从而极大地加速内存的访问。
*****
Finally there is uncacheable memory. This usually means the memory location is not backed by RAM at all. It might be a special address which is hardcoded to have some functionality outside the CPU. For commodity hardware this most often is the case for memory mapped address ranges which translate to accesses to cards and devices attached to a bus (PCIe etc). On embedded boards one sometimes finds such a memory address which can be used to turn an LED on and off. Caching such an address would obviously be a bad idea. LEDs in this context are used for debugging or status reports and one wants to see this as soon as possible. The memory on PCIe cards can change without the CPU's interaction, so this memory should not be cached.
最后来讲一下不可缓存的内存。一般指的是不被RAM支持的内存位置,它可以是硬编码的特殊地址,承担CPU以外的某些功能。对于商用硬件来说,比较常见的是映射到外部卡或设备的地址。在嵌入式主板上,有时也有类似的地址,用来开关LED。对这些地址进行缓存显然没有什么意义。比如上述的LED,一般是用来调试或报告状态,显然应该尽快点亮或关闭。而对于那些PCI卡上的内存,由于不需要CPU的干涉即可更改,也不该缓存。
- 程序优化
- vtune
- linux性能监控软件Perf
- 系统级性能分析工具perf的介绍与使用
- perf的二级命令
- 全局性概况
- 全局细节
- 最常用功能perf record
- 可视化工具perf timechart
- perf引入的overhead
- perf stat
- gprof
- 三种Linux性能分析工具的比较
- perf+gprof+gprof2dot+graphviz进行性能分析热点
- 英特尔多核平台编程优化大赛报告
- 内存操作
- mmap
- mmap的分类
- 深入理解内存映射mmap
- 计算机底层知识拾遗(九)深入理解内存映射mmap
- 内核驱动mmap Handler利用技术(一)
- Windows内存管理机制及C++内存分配实例
- Linux内存管理初探
- Windows CPU信息查看
- Linux CPU信息查看
- 预留大内存
- Linux下试验大页面映射
- /dev/mem
- Linux中通过/dev/mem操控物理地址
- /dev/mem分析
- 用法举例
- Linux下直接读写物理地址内存
- 查看内存信息
- Cache Memory
- 页面缓存
- 查看各级cache信息的方法
- dmidecode命令查看cache size
- CPU Cache 机制以及 Cache miss
- ARM体系关闭mmu和cache
- CR0-4寄存器介绍
- 查看CR0,CR2,CR3的值
- Linux 下如何禁用CPU cache
- 7个示例科普CPU Cache
- 第一个例子的C代码
- 其中之一
- Linux 从虚拟地址到物理地址
- 内存测试例子
- 每个程序员都应该了解的内存
- Part 1
- 程序员能够做什么
- 3 CPU caches
- 6 What Programmers Can Do
- VirtualAlloc
- Large-Page Support
- Some remarks on VirtualAlloc and MEM_LARGE_PAGES
- DMA
- MOV和MOVS的效率问题?如何高效的拷贝内存 中的数据
- how to use movntdqa to avoid cache pollution
- 计算机底层知识拾遗(一)理解虚拟内存机制
- How to access the control registers cr0,cr2,cr3 from a program
- 细说Cache-L1/L2/L3/TLB
- what-is-the-meaning-of-non-temporal-memory-accesses-in-x86
- How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?
- UA list
- GDB
- 程序运行参数
- Linux下GDB的多线程调试
- CMake
- CMake快速入门教程:实战
- cmake打印变量值
- function
- source_group
- cmake_parse_arguments
- 编译.S文件
- add_definitions
- CMake添加-g编译选项
- Debug模式下启动
- Mysql
- Mysql联合查询union和union all的使用介绍
- MySQL数据库导入错误:ERROR 1064 (42000) 和 ERROR at line xx: Unknown command '\Z'.
- 解决MYSQL数据库 Table ‘xxx’ is marked as crashed and should be repaired 145错误
- C/C++
- c语言中static的作用
- strlen和sizeof有什么区别?
- printf
- Libuv中文文档之线程
- RapidJSON
- gcc/g++ 实战之编译的四个过程
- __thread
- TARGET_LINK_LIBRARIES
- MAP_HUGETLB
- 使用Intel格式的汇编
- __m128i
- emmintrin.h
- _mm_stream_si128
- _mm_stream_load_si128
- _mm_load_si128
- _mm_xor_si128
- _mm_store_si128
- _mm_cvtsi128_si64
- Intel SSE指令集
- _mm_set_epi64x
- _mm_aesenc_si128
- _umul128
- _mm_malloc
- reinterpret_cast
- strlen
- 读取UTF-8的txt文件发现开头的多三个字节的问题
- PHP
- php计算函数执行时间的方法
- 框架
- Json Rpc远程调用框架
- PHP多进程
- PHP CLI模式下的多进程应用
- php多进程总结
- 优化
- PHP7 优化
- 让你的PHP7更快(GCC PGO)
- PHP的性能演进(从PHP5.0到PHP7.1的性能全评测)
- PHP字符串全排列算法
- 获取服务器基本信息
- cookie
- phpstudy2018 安装xdebug扩展
- 软件下载
- PHP mysqli_error() 函数
- PHP Session 变量
- curl
- curl_getinfo
- 获取请求头
- PHP使用CURL获取302跳转后的地址实例
- PHP基于cURL实现自动模拟登录
- PHP获取远程图片大小(CURL实现)
- CURL模拟登录
- curl模拟登录提交(从目录中获取文件)
- CURL HTTPS
- curl帮v
- rename
- copy
- JSON
- json_encode
- json_decode
- json_last_error_msg
- json_last_error
- PHP json_encode中文乱码解决方法
- var_dump
- PHPStorm与Xdebug设置
- Xdebug原理以notepad为例
- str_pad
- pack
- PHP二进制与字符串之间的相互转换
- PHP执行系统命令(简介及方法)
- 函数
- 十进制转二进制
- 字符串到ASSCI
- 字符串转二进制
- 合并两个表
- 图像识别
- Tesseract
- 虚拟机
- vmware下Kali 2.0安装VMware Tools
- 安装 VMware tools出现“正在进行简易安装时,无法手动启动VMware tools安装”
- 爬虫
- 有哪些好的数据来源或者大数据平台?
- Cygwin
- Git 常用命令
- 排列组合
- 含重复元素序列的全排列
- 全排列的非递归和递归实现(含重复元素)
- GitBook
- 编辑环境
- visual studio code
- 2名数学家或发现史上最快超大乘法运算法,欲破解困扰人类近半个世纪的问题
- 系统预定义常量
- 指令集
- SSE
- _MSC_VER
- msys2
- 安装cmake
- MSYS2 更新源
- 讲Cmake msys32使用问题解答 CXX CMAKE_C_COMPILER配置详解
- VirtualBox
- 解决virtualbox只能安装32位系统的问题
- Ubuntu
- 使用AES-NI的编译参数
- debian下安装内核源码的方法
- tar.xz结尾的文件的解压方法
- Linux命令
- insmod
- fatal error: openssl/bio.h
- 准备module的编译环境(kali)
- Ubuntu/Debian 之内核模块开发准备
- dmesg的详细用法
- Linux系统开机自动加载驱动module
- linux /Module 浅析(转载)
- Kali
- 找回gpedit
- Enable the Lock Pages in Memory Option (Windows)
- TLA
- 双系统
- 显卡
- 显示no CUDA的解决过程
