古风,穿越小说完本,小说网

主頁 > 知識(shí)庫 > postgres主備切換之文件觸發(fā)方式詳解

postgres主備切換之文件觸發(fā)方式詳解

本文測(cè)試參考PostgresSQL實(shí)戰(zhàn)一書。

本文檔測(cè)試環(huán)境：

主庫IP：192.168.40.130 主機(jī)名：postgres 端口：5442

備庫IP: 192.168.40.131 主機(jī)名：postgreshot 端口：5442

PostgreSQL9.0版本流復(fù)制主備切換只能通過創(chuàng)建觸發(fā)文件方式進(jìn)行，這一小節(jié)將介紹這種主備切換方式，測(cè)試環(huán)境為一主一備異步流復(fù)制環(huán)境，postgres上的數(shù)據(jù)庫為主庫，postgreshot上的數(shù)據(jù)庫為備庫，文件觸發(fā)方式的手工主備切換主要步驟如下：

1）配置備庫recovery.conf文件trigger_file參數(shù)，設(shè)置激活備庫的觸發(fā)文件路徑和名稱。

2）關(guān)閉主庫，建議使用-m fast模式關(guān)閉。

3）在備庫上創(chuàng)建觸發(fā)文件激活備庫，如果recovery.conf變成recovery.done表示備庫已經(jīng)切換成主庫。

4）這時(shí)需要將老的主庫切換成備庫，在老的主庫的$PGDATA目錄下創(chuàng)建recovery.conf文件（如果此目錄下不存在recovery.conf文件，可以根據(jù)$PGHOME/share/recovery.conf.sample模板文件復(fù)制一個(gè)，如果此目錄下存在recovery.done文件，需將recovery.done文件重命名為recovery.conf），配置和老的從庫一樣，只是primary_conninfo參數(shù)中的IP換成對(duì)端IP。

5）啟動(dòng)老的主庫，這時(shí)觀察主、備進(jìn)程是否正常，如果正常表示主備切換成功。

1、首先在備庫上配置recovery.conf，如下所示：

[postgres@postgreshot pg11]$ cat recovery.conf | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.130 port=5442 user=replica application_name=pg1'  # e.g. 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgreshot pg11]$

trigger_file可以配置成普通文件或隱藏文件，調(diào)整以上參數(shù)后需重啟備庫使配置參數(shù)生效。

2、關(guān)閉主庫，如下所示：

[postgres@postgres pg11]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@postgres pg11]$

3、在備庫上創(chuàng)建觸發(fā)文件激活備庫，如下所示：

[postgres@postgreshot pg11]$ ll recovery.conf 
-rwx------ 1 postgres postgres 5.9K Mar 26 18:47 recovery.conf
[postgres@postgreshot pg11]$ 
[postgres@postgreshot pg11]$ touch /home/postgres/pg11/trigger
[postgres@postgreshot pg11]$ ll recovery*
-rwx------ 1 postgres postgres 5.9K Mar 26 18:47 recovery.done
[postgres@postgreshot pg11]$

觸發(fā)器文件名稱和路徑需和recovery.conf配置文件trigger_file保持一致，再次查看recovery文件時(shí)，發(fā)現(xiàn)后輟由原來的.conf變成了.done

查看備庫數(shù)據(jù)庫日志，如下所示：

2019-03-26 23:30:19.399 EDT [93162] LOG: replication terminated by primary server
2019-03-26 23:30:19.399 EDT [93162] DETAIL: End of WAL reached on timeline 3 at 0/50003D0.
2019-03-26 23:30:19.399 EDT [93162] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-03-26 23:30:19.399 EDT [93158] LOG: invalid record length at 0/50003D0: wanted 24, got 0
2019-03-26 23:30:19.405 EDT [125172] FATAL: could not connect to the primary server: server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
2019-03-26 23:30:24.410 EDT [125179] FATAL: could not connect to the primary server: could not connect to server: Connection refused
  Is the server running on host "192.168.40.130" and accepting
  TCP/IP connections on port 5442?
2019-03-26 23:31:49.505 EDT [93158] LOG: trigger file found: /home/postgres/pg11/trigger
2019-03-26 23:31:49.506 EDT [93158] LOG: redo done at 0/5000360
2019-03-26 23:31:49.506 EDT [93158] LOG: last completed transaction was at log time 2019-03-26 19:03:11.202845-04
2019-03-26 23:31:49.516 EDT [93158] LOG: selected new timeline ID: 4
2019-03-26 23:31:50.063 EDT [93158] LOG: archive recovery complete
2019-03-26 23:31:50.083 EDT [93157] LOG: database system is ready to accept connections

根據(jù)備庫以上信息，由于關(guān)閉了主庫，首先日志顯示連接不上主庫，接著顯示發(fā)現(xiàn)了觸發(fā)文件，之后顯示恢復(fù)成功，數(shù)據(jù)庫切換成讀寫模式。

這時(shí)根據(jù)pg_controldata輸出進(jìn)行驗(yàn)證，如下所示：

[postgres@postgreshot ~]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgreshot ~]$

以上顯示數(shù)據(jù)庫角色已經(jīng)是主庫角色，在postgreshot上創(chuàng)建一張名為test_alived的表并插入數(shù)據(jù)，如下所示：

postgres=# CREATE TABLE test_alived2(id int4);
CREATE TABLE
postgres=# INSERT INTO test_alived2 VALUES(1);
INSERT 0 1
postgres=#

4、準(zhǔn)備將老的主庫切換成備庫角色，在老的主庫上配置recovery.conf，如下所示：

[postgres@postgres pg11]$ cat recovery.conf | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.131 port=5442 user=replica application_name=pg2'  # e.g. 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgres pg11]$

以上配置和postgreshot上的recovery.done配置文件基本一致，只是primary_conninfo參數(shù)的host選項(xiàng)配置成對(duì)端主機(jī)IP。

之后在postgres主機(jī)用戶家目錄創(chuàng)建~/.pgpass文件，如下所示：

[postgres@pghost1 ~]$ touch ~/.pgpass
[postgres@pghost1 ~]$ chmod 600 ~/.pgpass

并在~/.pgpass文件中插入以下內(nèi)容：

[postgres@postgres ~]$ cat .pgpass
192.168.40.130:5442:replication:replica:replica
192.168.40.131:5442:replication:replica:replica
[postgres@postgres ~]

之后啟動(dòng)postgres上的數(shù)據(jù)庫，如下所示：

[postgres@postgres ~]$ pg_ctl start
waiting for server to start....2019-03-26 23:38:50.424 EDT [55380] LOG: listening on IPv4 address "0.0.0.0", port 5442
2019-03-26 23:38:50.424 EDT [55380] LOG: listening on IPv6 address "::", port 5442
2019-03-26 23:38:50.443 EDT [55380] LOG: listening on Unix socket "/tmp/.s.PGSQL.5442"
2019-03-26 23:38:50.465 EDT [55381] LOG: database system was shut down in recovery at 2019-03-26 23:38:20 EDT
2019-03-26 23:38:50.465 EDT [55381] LOG: entering standby mode
2019-03-26 23:38:50.483 EDT [55381] LOG: consistent recovery state reached at 0/50003D0
2019-03-26 23:38:50.483 EDT [55381] LOG: invalid record length at 0/50003D0: wanted 24, got 0
2019-03-26 23:38:50.483 EDT [55380] LOG: database system is ready to accept read only connections
 done
server started
[postgres@postgres ~]$ 2019-03-26 23:38:50.565 EDT [55385] LOG: fetching timeline history file for timeline 4 from primary server
2019-03-26 23:38:50.588 EDT [55385] LOG: started streaming WAL from primary at 0/5000000 on timeline 3
2019-03-26 23:38:50.589 EDT [55385] LOG: replication terminated by primary server
2019-03-26 23:38:50.589 EDT [55385] DETAIL: End of WAL reached on timeline 3 at 0/50003D0.
2019-03-26 23:38:50.592 EDT [55381] LOG: new target timeline is 4
2019-03-26 23:38:50.594 EDT [55385] LOG: restarted WAL streaming at 0/5000000 on timeline 4
2019-03-26 23:38:50.717 EDT [55381] LOG: redo starts at 0/50003D0
 
[postgres@postgres ~]$ pg_controldata | grep cluster
Database cluster state:  in archive recovery
[postgres@postgres ~]$ 
 
postgres=# select * from test_alived2;
 id 
----
 1
(1 row)
 
postgres=#

同時(shí)，postgres上已經(jīng)有了WAL接收進(jìn)程，postgreshot上有了WAL發(fā)送進(jìn)程，說明老的主庫已經(jīng)成功切換成備庫，以上是主備切換的所有步驟。

為什么在步驟2中需要干凈地關(guān)閉主庫？數(shù)據(jù)庫關(guān)閉時(shí)首先做一次checkpoint，完成之后通知WAL發(fā)送進(jìn)程要關(guān)閉了，WAL發(fā)送進(jìn)程會(huì)將截止此次checkpoint的WAL日志流發(fā)送給備庫的WAL接收進(jìn)程，備節(jié)點(diǎn)接收到主庫最后發(fā)送來的WAL日志流后應(yīng)用WAL，從而達(dá)到了和主庫一致的狀態(tài)。

另一個(gè)需要注意的問題是假如主庫主機(jī)異常宕機(jī)了，如果激活備庫，備庫的數(shù)據(jù)完全和主庫一致嗎？此環(huán)境為一主一備異步流復(fù)制環(huán)境，備庫和主庫是異步同步方式，存在延時(shí)，這時(shí)主庫上已提交事務(wù)的WAL有可能還沒來得及發(fā)送給備庫，主庫主機(jī)就已經(jīng)宕機(jī)了，因此異步流復(fù)制備庫可能存在事務(wù)丟失的風(fēng)險(xiǎn)。

主備切換之pg_ctl promote方式

上面介紹了以文件觸發(fā)方式進(jìn)行主備切換，PostgreSQL9.1版本開始支持pg_ctl promote觸發(fā)方式，相比文件觸發(fā)方式操作更方便，promote命令語法如下：

pg_ctl promote [-D datadir]

-D是指數(shù)據(jù)目錄，如果不指定會(huì)使用環(huán)境變量$PGDATA設(shè)置的值。promote命令發(fā)出后，運(yùn)行中的備庫將停止恢復(fù)模式并切換成讀寫模式的主庫。

pg_ctl promote主備切換步驟和文件觸發(fā)方式大體相同，只是步驟1中不需要配置recovery.conf配置文件中的trigger_file參數(shù)，并且步驟3中換成以pg_ctl promote方式進(jìn)行主備切換，如下：

1）關(guān)閉主庫，建議使用-m fast模式關(guān)閉。

2）在備庫上執(zhí)行pg_ctl promote命令激活備庫，如果recovery.conf變成recovery.done表示備庫已切換成為主庫。

3）這時(shí)需要將老的主庫切換成備庫，在老的主庫的$PGDATA目錄下創(chuàng)建recovery.conf文件（如果此目錄下不存在recovery.conf文件，可以根據(jù)$PGHOME/share/recovery.conf.sample模板文件復(fù)制一個(gè)，如果此目錄下存在recovery.done文件，需將recovery.done文件重命名為recovery.conf），配置和老的從庫一樣，只是primary_conninfo參數(shù)中的IP換成對(duì)端IP。

4）啟動(dòng)老的主庫，這時(shí)觀察主、備進(jìn)程是否正常，如果正常表示主備切換成功。以上是pg_ctl promote主備切換的主要步驟，這一小節(jié)不進(jìn)行演示了，下一小節(jié)介紹pg_rewind工具時(shí)會(huì)給出使用pg_ctl promote進(jìn)行主備切換的示例

pg_rewind

pg_rewind是流復(fù)制維護(hù)時(shí)一個(gè)非常好的數(shù)據(jù)同步工具，在上一節(jié)介紹流復(fù)制主備切換內(nèi)容中講到了主要有五個(gè)步驟進(jìn)行主備切換，其中步驟2是在激活備庫前先關(guān)閉主庫，如果不做步驟2會(huì)出現(xiàn)什么樣的情況？下面我們舉例進(jìn)行演示，測(cè)試環(huán)境為一主一備異步流復(fù)制環(huán)境，postgres上的數(shù)據(jù)庫為主庫，postgreshot上的數(shù)據(jù)庫為備庫。

主備切換

--備節(jié)點(diǎn) recovery.conf 配置: postgreshot 上操作

備庫recovery.conf配置如下所示：

[postgres@postgreshot pg11]$ cat recovery.conf | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.130 port=5442 user=replica application_name=pg1'  # e.g. 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgreshot pg11]$

--激活備節(jié)點(diǎn): postgreshot 上操作

檢查流復(fù)制狀態(tài)，確保正常后在備庫主機(jī)上執(zhí)行以下命令激活備庫，如下所示

[postgres@postgreshot pg11]$ pg_ctl promote -D $PGDATA
waiting for server to promote.... done
server promoted
[postgres@postgreshot pg11]$ 
[postgres@postgreshot pg11]$

查看備庫數(shù)據(jù)庫日志，能夠看到數(shù)據(jù)庫正常打開接收外部連接的信息，這說明激活成功，檢查postgreshot上的數(shù)據(jù)庫角色，如下所示:

[postgres@postgreshot pg11]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgreshot pg11]$

從pg_controldata輸出也可以看到postgreshot上的數(shù)據(jù)庫已成為主庫，說明postgreshot上的數(shù)據(jù)庫已經(jīng)切換成主庫，這時(shí)老的主庫（postgres上的數(shù)據(jù)庫）依然還在運(yùn)行中，我們計(jì)劃將postgres上的角色轉(zhuǎn)換成備庫，先查看postgres上的數(shù)據(jù)庫角色，如下所示

[postgres@postgres pg11]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgres pg11]$

--備節(jié)點(diǎn)激活后，創(chuàng)建一張測(cè)試表并插入數(shù)據(jù)

postgres=# create table test_1(id int4);
CREATE TABLE
postgres=# insert into test_1(id) select n from generate_series(1,10) n;
INSERT 0 10
postgres=#

--停原來主節(jié)點(diǎn): postgres 上操作

[postgres@postgres pg11]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$ pg_ctl stop -m fast -D $PGDATA
2019-03-27 01:10:46.714 EDT [64858] LOG: received fast shutdown request
waiting for server to shut down....2019-03-27 01:10:46.716 EDT [64858] LOG: aborting any active transactions
2019-03-27 01:10:46.717 EDT [64858] LOG: background worker "logical replication launcher" (PID 64865) exited with exit code 1
2019-03-27 01:10:46.718 EDT [64860] LOG: shutting down
2019-03-27 01:10:46.731 EDT [64858] LOG: database system is shut down
 done
server stopped
[postgres@postgres pg11]$

--pg_rewind: postgres 上操作

[postgres@postgreshot pg11]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.40.131 port=5442 user=replica password=replica'
 
target server needs to use either data checksums or " = on"
Failure, exiting
[postgres@postgreshot pg11]$

備注：數(shù)據(jù)庫在 initdb 時(shí)需要開啟 checksums 或者設(shè)置 "wal_log_hints = on"，接著設(shè)置主，備節(jié)點(diǎn)的 wal_log_hints 參數(shù)并重啟數(shù)據(jù)庫。

[postgres@postgres pg11]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.40.131 port=5442 user=replica password=replica'
servers diverged at WAL location 0/70001E8 on timeline 5
rewinding from last common checkpoint at 0/6000098 on timeline 5
Done!
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$

備注：pg_rewind 成功。

--調(diào)整 recovery.conf 文件: postgres 操作

[postgres@postgres pg11]$ mv recovery.done recovery.conf
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$ cat recovery.conf | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.131 port=5442 user=replica application_name=pg2'  # e.g. 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgres pg11]$

--啟動(dòng)原主庫， postgres 上操作

[postgres@postgres pg11]$ pg_ctl start -D $PGDATA
waiting for server to start....2019-03-27 01:14:48.028 EDT [66323] LOG: listening on IPv4 address "0.0.0.0", port 5442
2019-03-27 01:14:48.028 EDT [66323] LOG: listening on IPv6 address "::", port 5442
2019-03-27 01:14:48.031 EDT [66323] LOG: listening on Unix socket "/tmp/.s.PGSQL.5442"
2019-03-27 01:14:48.045 EDT [66324] LOG: database system was interrupted while in recovery at log time 2019-03-27 01:08:08 EDT
2019-03-27 01:14:48.045 EDT [66324] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2019-03-27 01:14:48.084 EDT [66324] LOG: entering standby mode
2019-03-27 01:14:48.089 EDT [66324] LOG: redo starts at 0/6000060
2019-03-27 01:14:48.091 EDT [66324] LOG: invalid record length at 0/7024C98: wanted 24, got 0
2019-03-27 01:14:48.096 EDT [66331] LOG: started streaming WAL from primary at 0/7000000 on timeline 6
2019-03-27 01:14:48.109 EDT [66324] LOG: consistent recovery state reached at 0/7024CD0
2019-03-27 01:14:48.110 EDT [66323] LOG: database system is ready to accept read only connections
 done
server started
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$ pg_controldata | grep cluster
Database cluster state:  in archive recovery
[postgres@postgres pg11]$

--數(shù)據(jù)驗(yàn)證, postgres 上操作

[postgres@postgres pg11]$ p
psql (11.1)
Type "help" for help.
 
postgres=# select count(*) from test_1;
 count 
-------
 10
(1 row)
 
postgres=#

備注：pg_rewind 成功，原主庫現(xiàn)在是以備庫角色啟動(dòng)，而且數(shù)據(jù)表 test_1 也同步過來了。

pg_rewind 原理

The basic idea is to copy everything from the new cluster to the old cluster, except for the blocks that we know to be the same.

1)Scan the WAL log of the old cluster, starting from the last checkpoint before the point where the new cluster's timeline history forked off from the old cluster. For each WAL record, make a note of the data blocks that were touched. This yields a list of all the data blocks that were changed in the old cluster, after the new cluster forked off.

2)Copy all those changed blocks from the new cluster to the old cluster.

3)Copy all other files like clog, conf files etc. from the new cluster to old cluster. Everything except the relation files.

4) Apply the WAL from the new cluster, starting from the checkpoint created at failover. (Strictly speaking, pg_rewind doesn't apply the WAL, it just creates a backup label file indicating that when PostgreSQL is started, it will start replay from that checkpoint and apply all the required WAL.)

補(bǔ)充：postgres主備搭建時(shí)踩坑點(diǎn)

搭建pg主備流復(fù)制時(shí)的踩坑集錦

1: socket 路徑問題報(bào)錯(cuò)如下

你好！這是你第一次使用 **Markdown編輯器** 所展示的歡迎頁。如果你想學(xué)習(xí)如何使用Markdown編輯器,仔細(xì)閱讀這篇文章，了解一下Markdown的基本語法知識(shí)。解決方法：修改postgres.conf中unix_socket_permissions = ‘*' 路徑修改為上述報(bào)錯(cuò)中的路徑重啟即可

2:搭建主備時(shí) 備庫的data目錄一定一定一定要使用主庫基礎(chǔ)備份出來的數(shù)據(jù)。可采用pg_basebackup 的方式，也可以采用tar包打包解包的方式進(jìn)行基礎(chǔ)備份

如果備庫不小心已經(jīng)初始化過請(qǐng)刪除data目錄下的* 并使用主庫的基礎(chǔ)備份重新啟動(dòng)

3:備庫啟動(dòng)時(shí)報(bào)錯(cuò) FATAL: no pg_hba.conf entry for replication connection from host “172.20.0.16”, user “repl” 之類的問題

例如 master：IP： *.1 standby：IP *.2 主備賬號(hào)repl

那么在pg_hba.cnf中單單指明 host replication repl *.2 md5 是不行的

還需在此條記錄前面添加 host all all *.2 md5

首先要能訪問主庫才會(huì)資格使用repl賬號(hào)進(jìn)行同步的步驟

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教。

您可能感興趣的文章:

Postgresql創(chuàng)建新增、刪除與修改觸發(fā)器的方法
PostgreSQL+Pgpool實(shí)現(xiàn)HA主備切換的操作
PostgreSQL時(shí)間線(timeline)和History File的用法
基于postgresql行級(jí)鎖for update測(cè)試
查看postgresql數(shù)據(jù)庫用戶系統(tǒng)權(quán)限、對(duì)象權(quán)限的方法
Postgresql鎖機(jī)制詳解(表鎖和行鎖)
PostgreSQL function返回多行的操作

標(biāo)簽：烏海株洲晉城來賓錦州珠海衡陽蚌埠

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《postgres主備切換之文件觸發(fā)方式詳解》，本文關(guān)鍵詞 postgres,主備,切換,之,文件,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題，煩請(qǐng)?zhí)峁┫嚓P(guān)信息告之我們，我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無關(guān)。