盗墓笔记,择天记,盗墓笔记第二季

主頁(yè) > 知識(shí)庫(kù) > postgreSQL 非count方法算記錄數(shù)操作

postgreSQL 非count方法算記錄數(shù)操作

一般方法

select count(1) from table_name;

全量掃描一遍表，記錄越多，查詢速度越慢

新法

PostgreSQL 還真提供了一個(gè)這樣的途徑，那就是系統(tǒng)表 pg_class，這個(gè)系統(tǒng)表里頭，存儲(chǔ)著每個(gè)表的統(tǒng)計(jì)信息，其中 reltuples 就是對(duì)應(yīng)的表的統(tǒng)計(jì)行,統(tǒng)計(jì)行的數(shù)據(jù)是pg有個(gè)獨(dú)立進(jìn)程，定期掃描不同的表，收集這些表的統(tǒng)計(jì)信息，保存在系統(tǒng)表里頭。

方法如下：

select 
 reltuples::int as total 
from 
 pg_class 
where 
 relname = 'table_name' 
 and relnamespace = (select oid from pg_namespace where nspname = 'schema');

新方法不是通用的，如果要求特精確還是使用select count(1),如果是類似分頁(yè)的，且分頁(yè)數(shù)量超過(guò)比較多的，也不是要求特別精準(zhǔn)的，這就是一個(gè)好方法！

count(1) over 計(jì)算記錄數(shù)

select count(1) over(), * from table_name;

補(bǔ)充

count 是最常用的聚集函數(shù)之一，看似簡(jiǎn)單，其實(shí)還是有坑的，如：

1、count(*)：返回結(jié)果集的行數(shù)，是null也統(tǒng)計(jì)

2、count(1)：和count(*)基本沒(méi)區(qū)別，pg92之前都是掃描全表的，pg92之后增加了index only scan一般會(huì)變成掃主鍵索引，如果沒(méi)有主鍵或者是表的列很多的情況下，count(1)快一些，因?yàn)椴粫?huì)考慮表的全部字段

3、count(field)：返回?cái)?shù)據(jù)表中指定字段值不等于null的行數(shù)

拓展：理解 PostgreSQL 的 count 函數(shù)的行為

關(guān)于 count 函數(shù)的使用一直存在爭(zhēng)議，尤其是在 MySQL 中，作為流行度越來(lái)越高的 PostgreSQL 是否也有類似的問(wèn)題呢，我們通過(guò)實(shí)踐來(lái)理解一下 PostgreSQL 中 count 函數(shù)的行為。

構(gòu)建測(cè)試數(shù)據(jù)庫(kù)

創(chuàng)建測(cè)試數(shù)據(jù)庫(kù)，并創(chuàng)建測(cè)試表。測(cè)試表中有自增 ID、創(chuàng)建時(shí)間、內(nèi)容三個(gè)字段，自增 ID 字段是主鍵。

create database performance_test;

create table test_tbl (id serial primary key, created_at timestamp, content varchar(512));

生成測(cè)試數(shù)據(jù)

使用 generate_series 函數(shù)生成自增 ID，使用 now() 函數(shù)生成 created_at 列，對(duì)于 content 列，使用了 repeat(md5(random()::text), 10) 生成 10 個(gè) 32 位長(zhǎng)度的 md5 字符串。使用下列語(yǔ)句，插入 1000w 條記錄用于測(cè)試。

performance_test=# insert into test_tbl select generate_series(1,10000000),now(),repeat(md5(random()::text),10); INSERT 0 10000000 Time: 212184.223 ms (03:32.184)

由 count 語(yǔ)句引發(fā)的思考

默認(rèn)情況下 PostgreSQL 不開(kāi)啟 SQL 執(zhí)行時(shí)間的顯示，所以需要手動(dòng)開(kāi)啟一下，方便后面的測(cè)試對(duì)比。

\timing on

count(*) 和 count(1) 的性能區(qū)別是經(jīng)常被討論的問(wèn)題，分別使用 count(*) 和 count(1) 執(zhí)行一次查詢。

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 115090.380 ms (01:55.090)
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 738.502 ms

可以看到兩次查詢的速度差別非常大，count(1) 真的有這么大的性能提升？接下來(lái)再次運(yùn)行查詢語(yǔ)句。

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 657.831 ms
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 682.157 ms

可以看到第一次查詢時(shí)候會(huì)非常的慢，后面三次速度非?？觳⑶視r(shí)間相近，這里就有兩個(gè)問(wèn)題出現(xiàn)了：

為什么第一次查詢速度這么慢？

count(*) 和 count(1) 到底存不存在性能差別？

查詢緩存

使用 explain 語(yǔ)句重新執(zhí)行查詢語(yǔ)句

explain (analyze,buffers,verbose) select count(*) from test_tbl;

可以看到如下輸出：

Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=882.569..882.570 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=96 read=476095
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=882.492..884.170 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=96 read=476095
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=881.014..881.014 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=96 read=476095
        Worker 0: actual time=880.319..880.319 rows=1 loops=1
         Buffers: shared hit=34 read=158206
        Worker 1: actual time=880.369..880.369 rows=1 loops=1
         Buffers: shared hit=29 read=156424
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=0.029..662.165 rows=3333333 loops=3)
           Buffers: shared hit=96 read=476095
           Worker 0: actual time=0.026..661.807 rows=3323029 loops=1
            Buffers: shared hit=34 read=158206
           Worker 1: actual time=0.030..660.197 rows=3285513 loops=1
            Buffers: shared hit=29 read=156424
 Planning time: 0.043 ms
 Execution time: 884.207 ms

注意里面的 shared hit，表示命中了內(nèi)存中緩存的數(shù)據(jù)，這就可以解釋為什么后面的查詢會(huì)比第一次快很多。接下來(lái)去掉緩存，并重啟 PostgreSQL。

service postgresql stop
echo 1 > /proc/sys/vm/drop_caches
service postgresql start

重新執(zhí)行 SQL 語(yǔ)句，速度慢了很多。

 Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=50604.564..50604.564 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=476191
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=50604.508..50606.141 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared read=476191
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=50591.550..50591.551 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared read=476191
        Worker 0: actual time=50585.182..50585.182 rows=1 loops=1
         Buffers: shared read=158122
        Worker 1: actual time=50585.181..50585.181 rows=1 loops=1
         Buffers: shared read=161123
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=92.491..50369.691 rows=3333333 loops=3)
           Buffers: shared read=476191
           Worker 0: actual time=122.170..50362.271 rows=3320562 loops=1
            Buffers: shared read=158122
           Worker 1: actual time=14.020..50359.733 rows=3383583 loops=1
            Buffers: shared read=161123
 Planning time: 11.537 ms
 Execution time: 50606.215 ms

shared read 表示沒(méi)有命中緩存，通過(guò)這個(gè)現(xiàn)象可以推斷出，上一小節(jié)的四次查詢中，第一次查詢沒(méi)有命中緩存，剩下三次查詢都命中了緩存。

count(1) 和 count(*) 的區(qū)別

接下來(lái)探究 count(1) 和 count(*) 的區(qū)別是什么，繼續(xù)思考最開(kāi)始的四次查詢，第一次查詢使用了 count(*)，第二次查詢使用了 count(1) ，卻依然命中了緩存，不正是說(shuō)明 count(1) 和 count(*) 是一樣的嗎？

事實(shí)上，PostgreSQL 官方對(duì)于 is there a difference performance-wise between select count(1) and select count(*)? 問(wèn)題的回復(fù)也證實(shí)了這一點(diǎn)：

Nope. In fact, the latter is converted to the former during parsing.[2]

既然 count(1) 在性能上沒(méi)有比 count(*) 更好，那么使用 count(*) 就是更好的選擇。

sequence scan 和 index scan

接下來(lái)測(cè)試一下，在不同數(shù)據(jù)量大小的情況下 count(*) 的速度，將查詢語(yǔ)句寫在 count.sql 文件中，使用 pgbench 進(jìn)行測(cè)試。

pgbench -c 5 -t 20 performance_test -r -f count.sql

分別測(cè)試 200w - 1000w 數(shù)據(jù)量下的 count 語(yǔ)句耗時(shí)

數(shù)據(jù)大小	count耗時(shí)(ms)
200w	738.758
300w	1035.846
400w	1426.183
500w	1799.866
600w	2117.247
700w	2514.691
800w	2526.441
900w	2568.240
1000w	2650.434

繪制成耗時(shí)曲線

曲線的趨勢(shì)在 600w - 700w 數(shù)據(jù)量之間出現(xiàn)了轉(zhuǎn)折，200w - 600w 是線性增長(zhǎng)，600w 之后 count 的耗時(shí)就基本相同了。使用 explain 語(yǔ)句分別查看 600w 和 700w 數(shù)據(jù)時(shí)的 count 語(yǔ)句執(zhí)行。

700w：

Finalize Aggregate (cost=502185.93..502185.94 rows=1 width=8) (actual time=894.361..894.361 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=16344 read=352463
  -> Gather (cost=502185.72..502185.93 rows=2 width=8) (actual time=894.232..899.763 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=16344 read=352463
     -> Partial Aggregate (cost=501185.72..501185.73 rows=1 width=8) (actual time=889.371..889.371 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=16344 read=352463
        Worker 0: actual time=887.112..887.112 rows=1 loops=1
         Buffers: shared hit=5459 read=118070
        Worker 1: actual time=887.120..887.120 rows=1 loops=1
         Buffers: shared hit=5601 read=117051
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..493863.32 rows=2928960 width=0) (actual time=0.112..736.376 rows=2333333 loops=3)
           Index Cond: (test_tbl.id  7000000)
           Heap Fetches: 2328492
           Buffers: shared hit=16344 read=352463
           Worker 0: actual time=0.107..737.180 rows=2344479 loops=1
            Buffers: shared hit=5459 read=118070
           Worker 1: actual time=0.133..737.960 rows=2327028 loops=1
            Buffers: shared hit=5601 read=117051
 Planning time: 0.165 ms
 Execution time: 899.857 ms

600w：

Finalize Aggregate (cost=429990.94..429990.95 rows=1 width=8) (actual time=765.575..765.575 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=13999 read=302112
  -> Gather (cost=429990.72..429990.93 rows=2 width=8) (actual time=765.557..770.889 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=13999 read=302112
     -> Partial Aggregate (cost=428990.72..428990.73 rows=1 width=8) (actual time=763.821..763.821 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=13999 read=302112
        Worker 0: actual time=762.742..762.742 rows=1 loops=1
         Buffers: shared hit=4638 read=98875
        Worker 1: actual time=763.308..763.308 rows=1 loops=1
         Buffers: shared hit=4696 read=101570
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..422723.16 rows=2507026 width=0) (actual time=0.053..632.199 rows=2000000 loops=3)
           Index Cond: (test_tbl.id  6000000)
           Heap Fetches: 2018490
           Buffers: shared hit=13999 read=302112
           Worker 0: actual time=0.059..633.156 rows=1964483 loops=1
            Buffers: shared hit=4638 read=98875
           Worker 1: actual time=0.038..634.271 rows=2017026 loops=1
            Buffers: shared hit=4696 read=101570
 Planning time: 0.055 ms
 Execution time: 770.921 ms

根據(jù)以上現(xiàn)象推斷，PostgreSQL 似乎在 count 的數(shù)據(jù)量小于數(shù)據(jù)表長(zhǎng)度的某一比例時(shí)，才使用 index scan，通過(guò)查看官方 wiki 也可以看到相關(guān)描述：

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.[3]

根據(jù) Stackoverflow 上的回答，count 語(yǔ)句查詢的數(shù)量大于表大小的 3/4 時(shí)候就會(huì)用使用全表掃描代替索引掃描[4]。

結(jié)論

不要用 count(1) 或 count(列名) 代替 count(*)

count 本身是非常耗時(shí)的

count 可能是 index scan 也可能是 sequence scan，取決于 count 數(shù)量占表大小的比例

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教。

您可能感興趣的文章:

在postgresql數(shù)據(jù)庫(kù)中判斷是否是數(shù)字和日期時(shí)間格式函數(shù)操作
PostgreSQL 實(shí)現(xiàn)將多行合并轉(zhuǎn)為列
postgresql 實(shí)現(xiàn)sql多行語(yǔ)句合并一行
Postgresql自定義函數(shù)詳解
PostgreSQL刪除更新優(yōu)化操作
Postgresql排序與limit組合場(chǎng)景性能極限優(yōu)化詳解
postgresql通過(guò)索引優(yōu)化查詢速度操作
postgresql rank() over, dense_rank(), row_number()用法區(qū)別

標(biāo)簽：珠海衡陽(yáng) 株洲來(lái)賓烏海晉城錦州蚌埠

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《postgreSQL 非count方法算記錄數(shù)操作》，本文關(guān)鍵詞 postgreSQL,非,count,方法,算,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問(wèn)題，煩請(qǐng)?zhí)峁┫嚓P(guān)信息告之我們，我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無(wú)關(guān)。