Redis latency spikes and the Linux kernel: a few m-mysql教程-PHP中文網

首頁

資料庫

mysql教程

Redis latency spikes and the Linux kernel: a few m

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:41 PM

andlredisthe

Today I was testing Redis latency using m3.medium EC2 instances. I was able to replicate the usual latency spikes during BGSAVE, when the process forks, and the child starts saving the dataset on disk. However something was not as expected. The spike did not happened because of disk I/O, nor during the fork() call itself.

The test was performed with a 1GB of data in memory, with 150k writes per second originating from a different EC2 instance, targeting 5 million keys (evenly distributed). The pipeline was set to 4 commands. This translates to the following command line of redis-benchmark:

./redis-benchmark -P 4 -t set -r 5000000 -n 1000000000

Every time BGSAVE was triggered, I could see ~300 milliseconds latency spikes of unknown origin, since fork was taking 6 milliseconds. Fortunately Redis has a software watchdog feature, that is able to produce a stack trace of the process during a latency event. It’s quite a simple trick but works great: we setup a SIGALRM to be delivered by the kernel. Each time the serverCron() function is called, the scheduled signal is cleared, so actually Redis never receives it if the control returns fast enough to the Redis process. If instead there is a blocking condition, the signal is delivered by the kernel, and the signal handler prints the stack trace.

Instead of getting stack traces with the fork call, the process was always blocked near MOV* operations happening in the context of the parent process just after the fork. I started to develop the theory that Linux was “lazy forking” in some way, and the actual heavy stuff was happening later when memory was accessed and pages had to be copy-on-write-ed.

Next step was to read the fork() implementation of the Linux kernel. What the system call does is indeed to copy all the mapped regions (vm_area_struct structures). However a traditional implementation would also duplicate the PTEs at this point, and this was traditionally performed by copy_page_range(). However something changed… as an optimization years ago: now Linux does not just performs lazy page copying, as most modern kernels. The PTEs are also copied in a lazy way on faults. Here is the top comment of copy_range_range():

* Don't copy ptes where a page fault will fill them correctly.
* Fork becomes much lighter when there are big shared or private
* readonly mappings. The tradeoff is that copy_page_range is more
* efficient than faulting.

Basically as soon as the parent process performs an access in the shared regions with the child process, during the page fault Linux does the big amount of work skipped by fork, and this is why I could see always a MOV instruction in the stack trace.

While this behavior is not good for Redis, since to copy all the PTEs in a single operation is more efficient, it is much better for the traditional use case of fork() on POSIX systems, which is, fork()+exec*() in order to spawn a new process.

This issue is not EC2 specific, however virtualized instances are slower at copying PTEs, so the problem is less noticeable with physical servers.

However this is definitely not the full story. While I was testing this stuff in my Linux box, I remembered that using the libc malloc, instead of jemalloc, in certain conditions I was able to measure less latency spikes in the past. So I tried to check if there was some relation with that.

Indeed compiling with MALLOC=libc I was not able to measure any latency in the physical server, while with jemalloc I could observe the same behavior observed with the EC2 instance. To understand better the difference I setup a test with 15 million keys and a larger pipeline in order to stress more the system and make more likely that page faults of all the mmaped regions could happen in a very small interval of time. Then I repeated the same test with jemalloc and libc malloc:

bare metal, 675k/sec writes to 15 million keys, jemalloc: max spike 339 milliseconds.
bare metal, 675k/sec writes to 15 million keys, malloc: max spike 21 milliseconds.

I quickly tried to replicate the same result with EC2, same stuff, the spike was a fraction with malloc.

The next logical thing after this findings is to inspect what is the difference in the memory layout of a Redis system running with libc malloc VS one running with jemalloc. The Linux proc filesystem is handy to investigate the process internals (in this case I used /proc//smaps file).

Jemalloc memory is allocated in this region:

7f8002c00000-7f8062400000 rw-p 00000000 00:00 0
Size: 1564672 kB
Rss: 1564672 kB
Pss: 1564672 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 1564672 kB
Referenced: 1564672 kB
Anonymous: 1564672 kB
AnonHugePages: 1564672 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac sd

While libc big region looks like this:

0082f000-8141c000 rw-p 00000000 00:00 0 [heap]
Size: 2109364 kB
Rss: 2109276 kB
Pss: 2109276 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 2109276 kB
Referenced: 2109276 kB
Anonymous: 2109276 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac sd

Looks like here there are a couple different things.

1) There is [heap] in the first line only for libc malloc.
2) AnonHugePages field is zero for libc malloc but is set to the size of the region in the case of jemalloc.

Basically, the difference in latency appears to be due to the fact that malloc is using transparent huge pages, a kernel feature that allows to transparently glue multiple normal 4k pages into a few huge pages, which are 2048k each. This in turn means that copying the PTEs for this regions is much faster.

EDIT: Unfortunately I just spotted that I'm totally wrong, the huge pages apparently are only used by jemalloc: I just mis-read the outputs since this seemed so obvious. So on the contrary, it appears that the high latency is due to the huge pages thing for some unknown reason. So actually it is malloc that, while NOT using huge pages, is going much faster. I've no idea about what is happening here, so please disregard the above conclusions.

Meanwhile for low latency applications you may want to build Redis with “make MALLOC=libc”, however make sure to use “make distclean” before, and be aware that depending on the work load, libc malloc suffers fragmentation more than jemalloc.

More news soon…

EDIT2: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

This is the new Redis mantra apparently.

UPDATE: While this seemed unrealistic to me, I experimentally verified that the huge pages memory spike is due to the fact that with 50 clients writing at the same time, with N queued requests each, the Redis process can touch in the space of *a single event loop iteration* all the process pages, so its copy-on-writing the entire process address space. This means that not only huge pages are horrible for latency, but that are also horrible for memory usage. Comments

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

MySQL的位置：數據庫和編程Apr 13, 2025 am 12:18 AM

MySQL在數據庫和編程中的地位非常重要，它是一個開源的關係型數據庫管理系統，廣泛應用於各種應用場景。 1）MySQL提供高效的數據存儲、組織和檢索功能，支持Web、移動和企業級系統。 2）它使用客戶端-服務器架構，支持多種存儲引擎和索引優化。 3）基本用法包括創建表和插入數據，高級用法涉及多表JOIN和復雜查詢。 4）常見問題如SQL語法錯誤和性能問題可以通過EXPLAIN命令和慢查詢日誌調試。 5）性能優化方法包括合理使用索引、優化查詢和使用緩存，最佳實踐包括使用事務和PreparedStatemen

MySQL：從小型企業到大型企業Apr 13, 2025 am 12:17 AM

MySQL適合小型和大型企業。 1)小型企業可使用MySQL進行基本數據管理，如存儲客戶信息。 2)大型企業可利用MySQL處理海量數據和復雜業務邏輯，優化查詢性能和事務處理。

幻影是什麼讀取的，InnoDB如何阻止它們（下一個鍵鎖定）？Apr 13, 2025 am 12:16 AM

InnoDB通過Next-KeyLocking機制有效防止幻讀。 1）Next-KeyLocking結合行鎖和間隙鎖，鎖定記錄及其間隙，防止新記錄插入。 2）在實際應用中，通過優化查詢和調整隔離級別，可以減少鎖競爭，提高並發性能。

mysql：不是編程語言，而是...Apr 13, 2025 am 12:03 AM

MySQL不是一門編程語言，但其查詢語言SQL具備編程語言的特性：1.SQL支持條件判斷、循環和變量操作；2.通過存儲過程、觸發器和函數，用戶可以在數據庫中執行複雜邏輯操作。

MySQL：世界上最受歡迎的數據庫的簡介Apr 12, 2025 am 12:18 AM

MySQL是一種開源的關係型數據庫管理系統，主要用於快速、可靠地存儲和檢索數據。其工作原理包括客戶端請求、查詢解析、執行查詢和返回結果。使用示例包括創建表、插入和查詢數據，以及高級功能如JOIN操作。常見錯誤涉及SQL語法、數據類型和權限問題，優化建議包括使用索引、優化查詢和分錶分區。

MySQL的重要性：數據存儲和管理Apr 12, 2025 am 12:18 AM

MySQL是一個開源的關係型數據庫管理系統，適用於數據存儲、管理、查詢和安全。 1.它支持多種操作系統，廣泛應用於Web應用等領域。 2.通過客戶端-服務器架構和不同存儲引擎，MySQL高效處理數據。 3.基本用法包括創建數據庫和表，插入、查詢和更新數據。 4.高級用法涉及復雜查詢和存儲過程。 5.常見錯誤可通過EXPLAIN語句調試。 6.性能優化包括合理使用索引和優化查詢語句。

為什麼要使用mysql？利益和優勢Apr 12, 2025 am 12:17 AM

選擇MySQL的原因是其性能、可靠性、易用性和社區支持。 1.MySQL提供高效的數據存儲和檢索功能，支持多種數據類型和高級查詢操作。 2.採用客戶端-服務器架構和多種存儲引擎，支持事務和查詢優化。 3.易於使用，支持多種操作系統和編程語言。 4.擁有強大的社區支持，提供豐富的資源和解決方案。

描述InnoDB鎖定機制（共享鎖，獨家鎖，意向鎖，記錄鎖，間隙鎖，下一鍵鎖）。Apr 12, 2025 am 12:16 AM

InnoDB的鎖機制包括共享鎖、排他鎖、意向鎖、記錄鎖、間隙鎖和下一個鍵鎖。 1.共享鎖允許事務讀取數據而不阻止其他事務讀取。 2.排他鎖阻止其他事務讀取和修改數據。 3.意向鎖優化鎖效率。 4.記錄鎖鎖定索引記錄。 5.間隙鎖鎖定索引記錄間隙。 6.下一個鍵鎖是記錄鎖和間隙鎖的組合，確保數據一致性。

See all articles