Time | Source | Dest | Description |
---|---|---|---|
12:01:54.6536050 | Redis-Server | Clients | TCP:Flags=…AP… |
12:01:54.6538580 | Redis-Server | Clients | TCP:Flags=…AP… |
12:01:54.6539770 | Redis-Server | Clients | TCP:Flags=…AP… |
12:01:54.6720580 | Redis-Server | Clients | TCP:Flags=…A..S.. |
12:01:54.6727200 | Redis-Server | Clients | TCP:Flags=…A…… |
12:01:54.6808480 | Redis-Server | Clients | TCP:Flags=…AP….. |
12:01:54.6910840 | Redis-Server | Clients | TCP:Flags=…A…S., |
12:01:54.6911950 | Redis-Server | Clients | TCP:Flags=…A…… |
… | … | … | … |
12:01:56.1181350 | Redis-Server | Clients | TCP:Flags=…AP…. |
12:01:54.6808480, Redis Server端向客戶端發(fā)送了一個Push包,也就是對于查詢請求的一個結(jié)果返回。后面的包都是在做連接處理,包括Ack包,Ack確認(rèn)包,以及重置的RST包,緊接著下面一個Push包是在12:01:56.1181350發(fā)出的。中間的間隔是1.4372870秒。也就是說,在這1.4372870秒期間,Redis的服務(wù)器端,除了做一個查詢,其他的操作都是在做建連,或拒絕連接。
客戶端報錯的前后邏輯已經(jīng)清楚了,redis-server卡了1.43秒,client的connection pool被打滿,瘋狂新建連接,server的accept queue滿,直接拒絕服務(wù),client報錯。開始懷疑client發(fā)送了特殊命令,這時需要確認(rèn)一下client的最后幾個命令是什么,找到redis-server卡死前的第一個包,裝一個wireshark的redis插件,看到最后幾個命令是簡單的get,并且key-value都很小,不至于需要耗費1.43秒才能完成。服務(wù)端也沒有slow log,此時排障再次陷入僵局。
【進一步分析】
為了了解這1.43秒之內(nèi),Redis Server在做什么事情,我們用pstack來抓取信息。Pstack本質(zhì)上是gdb attach. 高頻率的抓取會影響redis的吞吐。死循環(huán)0.5秒一次無腦抓,在redis-server卡死的時候抓到堆棧如下(過濾了沒用的棧信息):
Thu May 31 11:29:18 CST 2018
Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):
#0 0x000000000048cec4 in ?? ()
#1 0x00000000004914a4 in je_arena_ralloc ()
#2 0x00000000004836a1 in je_realloc ()
#3 0x0000000000422cc5 in zrealloc ()
#4 0x00000000004213d7 in sdsRemoveFreeSpace ()
#5 0x000000000041ef3c in clientsCronResizeQueryBuffer ()
#6 0x00000000004205de in clientsCron ()
#7 0x0000000000420784 in serverCron ()
#8 0x0000000000418542 in aeProcessEvents ()
#9 0x000000000041873b in aeMain ()
#10 0x0000000000420fce in main ()
Thu May 31 11:29:19 CST 2018
Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):
#0 0x0000003729ee5407 in madvise () from /lib64/libc.so.6
#1 0x0000000000493a4e in je_pages_purge ()
#2 0x000000000048cf70 in ?? ()
#3 0x00000000004914a4 in je_arena_ralloc ()
#4 0x00000000004836a1 in je_realloc ()
#5 0x0000000000422cc5 in zrealloc ()
#6 0x00000000004213d7 in sdsRemoveFreeSpace ()
#7 0x000000000041ef3c in clientsCronResizeQueryBuffer ()
#8 0x00000000004205de in clientsCron ()
#9 0x0000000000420784 in serverCron ()
#10 0x0000000000418542 in aeProcessEvents ()
#11 0x000000000041873b in aeMain ()
#12 0x0000000000420fce in main ()
Thu May 31 11:29:19 CST 2018
Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):
#0 0x000000000048108c in je_malloc_usable_size ()
#1 0x0000000000422be6 in zmalloc ()
#2 0x00000000004220bc in sdsnewlen ()
#3 0x000000000042c409 in createStringObject ()
#4 0x000000000042918e in processMultibulkBuffer ()
#5 0x0000000000429662 in processInputBuffer ()
#6 0x0000000000429762 in readQueryFromClient ()
#7 0x000000000041847c in aeProcessEvents ()
#8 0x000000000041873b in aeMain ()
#9 0x0000000000420fce in main ()
Thu May 31 11:29:20 CST 2018
Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):
#0 0x000000372a60e7cd in write () from /lib64/libpthread.so.0
#1 0x0000000000428833 in sendReplyToClient ()
#2 0x0000000000418435 in aeProcessEvents ()
#3 0x000000000041873b in aeMain ()
#4 0x0000000000420fce in main ()
重復(fù)多次抓取后,從堆棧中發(fā)現(xiàn)可疑堆棧clientsCronResizeQueryBuffer位置,屬于serverCron()函數(shù)下,這個redis-server內(nèi)部的定時調(diào)度,并不在用戶線程下,這個解釋了為什么卡死的時候沒有出現(xiàn)慢查詢。
查看redis源碼,確認(rèn)到底redis-server在做什么:
clientsCron(server.h): #define CLIENTS_CRON_MIN_ITERATIONS 5 void clientsCron(void) { /* Make sure to process at least numclients/server.hz of clients * per call. Since this function is called server.hz times per second * we are sure that in the worst case we process all the clients in 1 * second. */ int numclients = listLength(server.clients); int iterations = numclients/server.hz; mstime_t now = mstime(); /* Process at least a few clients while we are at it, even if we need * to process less than CLIENTS_CRON_MIN_ITERATIONS to meet our contract * of processing each client once per second. */ if (iterations CLIENTS_CRON_MIN_ITERATIONS) iterations = (numclients CLIENTS_CRON_MIN_ITERATIONS) ? numclients : CLIENTS_CRON_MIN_ITERATIONS; while(listLength(server.clients) iterations--) { client *c; listNode *head; /* Rotate the list, take the current head, process. * This way if the client must be removed from the list it's the * first element and we don't incur into O(N) computation. */ listRotate(server.clients); head = listFirst(server.clients); c = listNodeValue(head); /* The following functions do different service checks on the client. * The protocol is that they return non-zero if the client was * terminated. */ if (clientsCronHandleTimeout(c,now)) continue; if (clientsCronResizeQueryBuffer(c)) continue; } }
clientsCron首先判斷當(dāng)前client的數(shù)量,用于控制一次清理連接的數(shù)量,生產(chǎn)服務(wù)器單實例的連接數(shù)量在5000不到,也就是一次清理的連接數(shù)是50個。
clientsCronResizeQueryBuffer(server.h): /* The client query buffer is an sds.c string that can end with a lot of * free space not used, this function reclaims space if needed. * * The function always returns 0 as it never terminates the client. */ int clientsCronResizeQueryBuffer(client *c) { size_t querybuf_size = sdsAllocSize(c->querybuf); time_t idletime = server.unixtime - c->lastinteraction; /* 只在以下兩種情況下會Resize query buffer: * 1) Query buffer > BIG_ARG(在server.h 中定義#define PROTO_MBULK_BIG_ARG (1024*32)) 且這個Buffer的小于一段時間的客戶端使用的峰值. * 2) 客戶端空閑超過2s且Buffer size大于1k. */ if (((querybuf_size > PROTO_MBULK_BIG_ARG) (querybuf_size/(c->querybuf_peak+1)) > 2) || (querybuf_size > 1024 idletime > 2)) { /* Only resize the query buffer if it is actually wasting space. */ if (sdsavail(c->querybuf) > 1024) { c->querybuf = sdsRemoveFreeSpace(c->querybuf); } } /* Reset the peak again to capture the peak memory usage in the next * cycle. */ c->querybuf_peak = 0; return 0; }
如果redisClient對象的query buffer滿足條件,那么就直接resize掉。滿足條件的連接分成兩種,一種是真的很大的,比該客戶端一段時間內(nèi)使用的峰值還大;還有一種是很閑(idle>2)的,這兩種都要滿足一個條件,就是buffer free的部分超過1k。那么redis-server卡住的原因就是正好有那么50個很大的或者空閑的并且free size超過了1k大小連接的同時循環(huán)做了resize,由于redis都屬于單線程工作的程序,所以block了client。那么解決這個問題辦法就很明朗了,讓resize 的頻率變低或者resize的執(zhí)行速度變快。
既然問題出在query buffer上,我們先看一下這個東西被修改的位置:
readQueryFromClient(networking.c): redisClient *createClient(int fd) { redisClient *c = zmalloc(sizeof(redisClient)); /* passing -1 as fd it is possible to create a non connected client. * This is useful since all the Redis commands needs to be executed * in the context of a client. When commands are executed in other * contexts (for instance a Lua script) we need a non connected client. */ if (fd != -1) { anetNonBlock(NULL,fd); anetEnableTcpNoDelay(NULL,fd); if (server.tcpkeepalive) anetKeepAlive(NULL,fd,server.tcpkeepalive); if (aeCreateFileEvent(server.el,fd,AE_READABLE, readQueryFromClient, c) == AE_ERR) { close(fd); zfree(c); return NULL; } } selectDb(c,0); c->id = server.next_client_id++; c->fd = fd; c->name = NULL; c->bufpos = 0; c->querybuf = sdsempty(); 初始化是0 readQueryFromClient(networking.c): void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) { redisClient *c = (redisClient*) privdata; int nread, readlen; size_t qblen; REDIS_NOTUSED(el); REDIS_NOTUSED(mask); server.current_client = c; readlen = REDIS_IOBUF_LEN; /* If this is a multi bulk request, and we are processing a bulk reply * that is large enough, try to maximize the probability that the query * buffer contains exactly the SDS string representing the object, even * at the risk of requiring more read(2) calls. This way the function * processMultiBulkBuffer() can avoid copying buffers to create the * Redis Object representing the argument. */ if (c->reqtype == REDIS_REQ_MULTIBULK c->multibulklen c->bulklen != -1 c->bulklen >= REDIS_MBULK_BIG_ARG) { int remaining = (unsigned)(c->bulklen+2)-sdslen(c->querybuf); if (remaining readlen) readlen = remaining; } qblen = sdslen(c->querybuf); if (c->querybuf_peak qblen) c->querybuf_peak = qblen; c->querybuf = sdsMakeRoomFor(c->querybuf, readlen); 在這里會被擴大
由此可見c->querybuf在連接第一次讀取命令后的大小就會被分配至少1024*32,所以回過頭再去看resize的清理邏輯就明顯存在問題,每個被使用到的query buffer的大小至少就是1024*32,但是清理的時候判斷條件是>1024,也就是說,所有的idle>2的被使用過的連接都會被resize掉,下次接收到請求的時候再重新分配到1024*32,這個其實是沒有必要的,在訪問比較頻繁的群集,內(nèi)存會被頻繁得回收重分配,所以我們嘗試將清理的判斷條件改造為如下,就可以避免大部分沒有必要的resize操作:
if (((querybuf_size > REDIS_MBULK_BIG_ARG) (querybuf_size/(c->querybuf_peak+1)) > 2) || (querybuf_size > 1024*32 idletime > 2)) { /* Only resize the query buffer if it is actually wasting space. */ if (sdsavail(c->querybuf) > 1024*32) { c->querybuf = sdsRemoveFreeSpace(c->querybuf); } }
這個改造的副作用是內(nèi)存的開銷,按照一個實例5k連接計算,5000*1024*32=160M,這點內(nèi)存消耗對于上百G內(nèi)存的服務(wù)器完全可以接受。
【問題重現(xiàn)】
在使用修改過源碼的Redis server后,問題仍然重現(xiàn)了,客戶端還是會報同類型的錯誤,且報錯的時候,服務(wù)器內(nèi)存依然會出現(xiàn)抖動。抓取內(nèi)存堆棧信息如下:
Thu Jun 14 21:56:54 CST 2018
#3 0x0000003729ee893d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)):
#0 0x0000003729ee5400 in madvise () from /lib64/libc.so.6
#1 0x0000000000493a1e in je_pages_purge ()
#2 0x000000000048cf40 in arena_purge ()
#3 0x00000000004a7dad in je_tcache_bin_flush_large ()
#4 0x00000000004a85e9 in je_tcache_event_hard ()
#5 0x000000000042c0b5 in decrRefCount ()
#6 0x000000000042744d in resetClient ()
#7 0x000000000042963b in processInputBuffer ()
#8 0x0000000000429762 in readQueryFromClient ()
#9 0x000000000041847c in aeProcessEvents ()
#10 0x000000000041873b in aeMain ()
#11 0x0000000000420fce in main ()
Thu Jun 14 21:56:54 CST 2018
Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)):
#0 0x0000003729ee5400 in madvise () from /lib64/libc.so.6
#1 0x0000000000493a1e in je_pages_purge ()
#2 0x000000000048cf40 in arena_purge ()
#3 0x00000000004a7dad in je_tcache_bin_flush_large ()
#4 0x00000000004a85e9 in je_tcache_event_hard ()
#5 0x000000000042c0b5 in decrRefCount ()
#6 0x000000000042744d in resetClient ()
#7 0x000000000042963b in processInputBuffer ()
#8 0x0000000000429762 in readQueryFromClient ()
#9 0x000000000041847c in aeProcessEvents ()
#10 0x000000000041873b in aeMain ()
#11 0x0000000000420fce in main ()
顯然,Querybuffer被頻繁resize的問題已經(jīng)得到了優(yōu)化,但是還是會出現(xiàn)客戶端報錯。這就又陷入了僵局。難道還有其他因素導(dǎo)致query buffer resize變慢?我們再次抓取pstack。但這時,jemalloc引起了我們的注意。此時回想Redis的內(nèi)存分配機制,Redis為避免libc內(nèi)存不被釋放導(dǎo)致大量內(nèi)存碎片的問題,默認(rèn)使用的是jemalloc用作內(nèi)存分配管理,這次報錯的堆棧信息中都是je_pages_purge () redis在調(diào)用jemalloc回收臟頁。我們看下jemalloc做了些什么:
arena_purge(arena.c) static void arena_purge(arena_t *arena, bool all) { arena_chunk_t *chunk; size_t npurgatory; if (config_debug) { size_t ndirty = 0; arena_chunk_dirty_iter(arena->chunks_dirty, NULL, chunks_dirty_iter_cb, (void *)ndirty); assert(ndirty == arena->ndirty); } assert(arena->ndirty > arena->npurgatory || all); assert((arena->nactive >> opt_lg_dirty_mult) (arena->ndirty - arena->npurgatory) || all); if (config_stats) arena->stats.npurge++; npurgatory = arena_compute_npurgatory(arena, all); arena->npurgatory += npurgatory; while (npurgatory > 0) { size_t npurgeable, npurged, nunpurged; /* Get next chunk with dirty pages. */ chunk = arena_chunk_dirty_first(arena->chunks_dirty); if (chunk == NULL) { arena->npurgatory -= npurgatory; return; } npurgeable = chunk->ndirty; assert(npurgeable != 0); if (npurgeable > npurgatory chunk->nruns_adjac == 0) { arena->npurgatory += npurgeable - npurgatory; npurgatory = npurgeable; } arena->npurgatory -= npurgeable; npurgatory -= npurgeable; npurged = arena_chunk_purge(arena, chunk, all); nunpurged = npurgeable - npurged; arena->npurgatory += nunpurged; npurgatory += nunpurged; } }
Jemalloc每次回收都會判斷所有實際應(yīng)該清理的chunck并對清理做count,這個操作對于高響應(yīng)要求的系統(tǒng)是很奢侈的,所以我們考慮通過升級jemalloc的版本來優(yōu)化purge的性能。Redis 4.0版本發(fā)布后,性能有很大的改進,并可以通過命令回收內(nèi)存,我們線上也正準(zhǔn)備進行升級,跟隨4.0發(fā)布的jemalloc版本為4.1,jemalloc的版本使用的在jemalloc的4.0之后版本的arena_purge()做了很多優(yōu)化,去掉了計數(shù)器的調(diào)用,簡化了很多判斷邏輯,增加了arena_stash_dirty()方法合并了之前的計算和判斷邏輯,增加了purge_runs_sentinel,用保持臟塊在每個arena LRU中的方式替代之前的保持臟塊在arena樹的dirty-run-containing chunck中的方式,大幅度減少了臟塊purge的體積,并且在內(nèi)存回收過程中不再移動內(nèi)存塊。代碼如下:
arena_purge(arena.c) static void arena_purge(arena_t *arena, bool all) { chunk_hooks_t chunk_hooks = chunk_hooks_get(arena); size_t npurge, npurgeable, npurged; arena_runs_dirty_link_t purge_runs_sentinel; extent_node_t purge_chunks_sentinel; arena->purging = true; /* * Calls to arena_dirty_count() are disabled even for debug builds * because overhead grows nonlinearly as memory usage increases. */ if (false config_debug) { size_t ndirty = arena_dirty_count(arena); assert(ndirty == arena->ndirty); } assert((arena->nactive >> arena->lg_dirty_mult) arena->ndirty || all); if (config_stats) arena->stats.npurge++; npurge = arena_compute_npurge(arena, all); qr_new(purge_runs_sentinel, rd_link); extent_node_dirty_linkage_init(purge_chunks_sentinel); npurgeable = arena_stash_dirty(arena, chunk_hooks, all, npurge, purge_runs_sentinel, purge_chunks_sentinel); assert(npurgeable >= npurge); npurged = arena_purge_stashed(arena, chunk_hooks, purge_runs_sentinel, purge_chunks_sentinel); assert(npurged == npurgeable); arena_unstash_purged(arena, chunk_hooks, purge_runs_sentinel, purge_chunks_sentinel); arena->purging = false; }
【解決問題】
實際上我們有多個選項??梢允褂肎oogle的tcmalloc來代替jemalloc,可以升級jemalloc的版本等等。我們根據(jù)上面的分析,嘗試通過升級jemalloc版本,實際操作為升級Redis版本來解決。我們將Redis的版本升級到4.0.9之后觀察,線上客戶端連接超時這個棘手的問題得到了解決。
【問題總結(jié)】
Redis在生產(chǎn)環(huán)境中因其支持高并發(fā),響應(yīng)快,易操作被廣泛使用,對于運維人員而言,其響應(yīng)時間的要求帶來了各種各樣的問題,Redis的連接超時問題是其中比較典型的一種,從發(fā)現(xiàn)問題,客戶端連接超時,到通過抓取客戶端與服務(wù)端的網(wǎng)絡(luò)包,內(nèi)存堆棧定位問題,也被其中一些假象所迷惑,最終通過升級jemalloc(Redis)的版本解決問題,這次最值得總結(jié)和借鑒的是整個分析的思路。
總結(jié)
以上就是這篇文章的全部內(nèi)容了,希望本文的內(nèi)容對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,如果有疑問大家可以留言交流,謝謝大家對腳本之家的支持。
標(biāo)簽:南寧 定州 拉薩 河源 畢節(jié) 伊春 甘南 泰州
巨人網(wǎng)絡(luò)通訊聲明:本文標(biāo)題《Redis偶發(fā)連接失敗案例實戰(zhàn)記錄》,本文關(guān)鍵詞 Redis,偶發(fā),連接,失敗,案例,;如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題,煩請?zhí)峁┫嚓P(guān)信息告之我們,我們將及時溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò),涉及言論、版權(quán)與本站無關(guān)。