关于Redis哨兵模式的原理,我在进阶篇的《分布式框架之高性能:Redis哨兵模式》已经详细讲解过了,不熟悉的读者可以先去了解下。
本章,我将带领大家部署一个3节点的哨兵集群,并介绍如何基于哨兵进行故障转移,以及一些企业级的配置方案。
一、哨兵部署
我们先来构建一个哨兵集群,一般哨兵集群至少需要三个独立的机器节点,这样才能保证哨兵自身的高可用。我在前几个章节已经带领大家部署了一主二从的Redis架构。本章,我们就在之前的ressmix-dsf01、ressmix-dsf02、ressmix-dsf03上分别部署一个哨兵,运行在5000端口,整个架构逻辑图如下:
1.1 哨兵配置
我以ressmix-dsf01节点为例,进行部署。哨兵的配置文件位于Redis安装包的根目录下,名叫sentinel.conf。我们先创建两个目录:
#存放哨兵配置文件
mkdir /etc/sentinal
#哨兵的工作目录
mkdir -p /var/sentinel/5000
#存放哨兵日志
mkdir -p /var/log/sentinel/5000
然后,把配置文件复制到/etc/sentinal目录,并重命名为5000.conf。接着,修改配置文件中的如下参数:
port 5000
bind 192.168.0.107
daemonize yes
logfile /var/log/sentinel/5000/sentinel.log
dir /var/sentinel/5000
sentinel monitor mymaster ressmix-dsf01 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
sentinel auth-pass mymaster ressmix
我来解释下比较重要的几个参数:
- port 5000:就是哨兵自身运行的端口,默认是26379,我们修改成了5000;
- bind 192.168.0.107:默认情况下,哨兵只能从127.0.0.1访问,我们修改成机器的IP,这样哨兵之间可以互相通信;
- dir /var/sentinal/5000:哨兵自身的工作目录;
- sentinel monitor mymaster ressmix-dsf01 6379 2:这里的mymaster就是哨兵要监控的集群名称,可以自定义,我这里命名为”mymaster“。ressmix-dsf01是这个集群中的Master节点的IP,我这里直接用主机名,6379就是Master节点的运行端口。最后一个2表示quorum,也就是当有quorum个哨兵认为主节点客观下线时,就开始哨兵Leader选举,进行故障转移,由于我们一共有三个哨兵,所以按照大多数原则,quorum=(3/2)+1=2;
- down-after-milliseconds:表示如果哨兵超过这个时间都没法跟Redis主节点取得联系,那就可能认为这个redis实例挂了,即主观下线;
- parallel-syncs:表示发生故障转移时,选举了新的Master节点,那同时挂载多少个Slave节点去同步数据,我这里用默认值1个;
- failover-timeout:表示哨兵Leader进行故障转移的超时时间,如果超过这个时间还没做完故障转移,就会重新选举哨兵Leader主持故障转移,默认3分钟;
- 最后,如果Master节点设置了认证口令,一定不要忘记在哨兵配置
sentinel auth-pass
中都加上密码。
全部配置完成后,我们执行以下命令启动哨兵:
redis-server /etc/sentinal/5000.conf --sentinel
如下图所示,哨兵发现了Master的两个Slave节点,并且也发现了另外两个哨兵,哨兵之间会互相通过我在进阶篇中讲到过的消息发布/订阅机制进行通信:
1.2 哨兵状态
我们可以通过命令检查下哨兵的状态:
redis-cli -h 192.168.0.109 -p 5000
查看集群中的Master节点状态:
192.168.0.109:5000> sentinel master mymaster
1) "name"
2) "mymaster"
3) "ip"
4) "192.168.0.107"
5) "port"
6) "6379"
7) "runid"
8) "5cb6aed556e853b886a0722170ef024aebc1ace4"
9) "flags"
10) "master"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "362"
19) "last-ping-reply"
20) "362"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "4481"
25) "role-reported"
26) "master"
27) "role-reported-time"
28) "285881"
29) "config-epoch"
30) "0"
31) "num-slaves"
32) "2"
33) "num-other-sentinels"
34) "2"
35) "quorum"
36) "2"
37) "failover-timeout"
38) "180000"
39) "parallel-syncs"
40) "1"
查看集群中的Slave节点的状态:
192.168.0.109:5000> SENTINEL slaves mymaster
1) 1) "name"
2) "192.168.0.109:6379"
3) "ip"
4) "192.168.0.109"
5) "port"
6) "6379"
7) "runid"
8) "fe25299543962d47a64197664b281ce0d9e49410"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "612"
19) "last-ping-reply"
20) "612"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "9237"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "340548"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "192.168.0.107"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "77323"
2) 1) "name"
2) "192.168.0.110:6379"
3) "ip"
4) "192.168.0.110"
5) "port"
6) "6379"
7) "runid"
8) "168c3cb9c0162f91c6a047e8b20c0b1562356a2f"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "612"
19) "last-ping-reply"
20) "612"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "9237"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "340842"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "192.168.0.107"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "77323"
查看监控这个集群的其它哨兵的状态:
192.168.0.109:5000> SENTINEL sentinels mymaster
1) 1) "name"
2) "cd48456d2f4342db47efb9f33bf679aa5b611e56"
3) "ip"
4) "192.168.0.110"
5) "port"
6) "5000"
7) "runid"
8) "cd48456d2f4342db47efb9f33bf679aa5b611e56"
9) "flags"
10) "sentinel"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "72"
19) "last-ping-reply"
20) "72"
21) "down-after-milliseconds"
22) "30000"
23) "last-hello-message"
24) "221"
25) "voted-leader"
26) "?"
27) "voted-leader-epoch"
28) "0"
2) 1) "name"
2) "7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217"
3) "ip"
4) "192.168.0.107"
5) "port"
6) "5000"
7) "runid"
8) "7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217"
9) "flags"
10) "sentinel"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "72"
19) "last-ping-reply"
20) "72"
21) "down-after-milliseconds"
22) "30000"
23) "last-hello-message"
24) "254"
25) "voted-leader"
26) "?"
27) "voted-leader-epoch"
28) "0"
二、容灾演练
哨兵集群部署完成后,我们可以进行下容灾演练,看看哨兵是不是真的做到了故障自动转移。现在,我这边d的Redis主从架构是下面这样的:
Master,部署在ressmix-dsf01: 192.168.0.107
Slave1,部署在ressmix-dsf02: 192.168.0.109
Slave2,部署在ressmix-dsf03: 192.168.0.110
2.1 故障转移
我先把Master节点kill -9掉,然后把它的pid文件(/var/run/redis_6379.pid)也删除掉,用来模拟Master节点挂掉。等待30s后,T通过日志可以发现哨兵进行了故障自动转移,下面是ressmix-dsf01节点上的哨兵日志:
1385:X 25 Apr 2020 14:16:45.880 # +sdown master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:45.935 # +odown master mymaster 192.168.0.107 6379 #quorum 3/2
1385:X 25 Apr 2020 14:16:45.935 # +new-epoch 1
1385:X 25 Apr 2020 14:16:45.935 # +try-failover master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:45.971 # +vote-for-leader 7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217 1
1385:X 25 Apr 2020 14:16:45.973 # 8e23c3b5d6d9edc4dbb845dc8b8e858e4ce2142c voted for 8e23c3b5d6d9edc4dbb845dc8b8e858e4ce2142c 1
1385:X 25 Apr 2020 14:16:46.020 # cd48456d2f4342db47efb9f33bf679aa5b611e56 voted for 7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217 1
1385:X 25 Apr 2020 14:16:46.037 # +elected-leader master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.037 # +failover-state-select-slave master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.090 # +selected-slave slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.090 * +failover-state-send-slaveof-noone slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.148 * +failover-state-wait-promotion slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.913 # +promoted-slave slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.913 # +failover-state-reconf-slaves master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:46.932 * +slave-reconf-sent slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:47.127 # -odown master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:16:47.915 * +slave-reconf-inprog slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:19:46.900 # +failover-end-for-timeout master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:19:46.900 # +failover-end master mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:19:46.900 * +slave-reconf-sent-be slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.107 6379
1385:X 25 Apr 2020 14:19:46.900 # +switch-master mymaster 192.168.0.107 6379 192.168.0.110 6379
1385:X 25 Apr 2020 14:19:46.900 * +slave slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.110 6379
1385:X 25 Apr 2020 14:19:46.900 * +slave slave 192.168.0.107:6379 192.168.0.107 6379 @ mymaster 192.168.0.110 6379
可以看到,新的Master节点变成了192.168.0.110,也就是ressmix-dsf03。我们可以通过命令info replication
看下,它的角色已经变成了Master:
192.168.0.110:6379> info replication
# Replication
role:master
connected_slaves:0
master_replid:07383ad9fb832365fc67b9a578c54a36a21ff274
master_replid2:f5180409b8a45a9f71eed3c8241bcedbc8986a48
master_repl_offset:354070
second_repl_offset:297908
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:354070
2.2 故障恢复
然后,我们再恢复ressmix-dsf01上的redis节点,这样它就会被作为Slave节点加入到集群中。我们可以通过任一哨兵看下集群的状态:
192.168.0.109:5000> SENTINEL slaves mymaster
1) 1) "name"
2) "192.168.0.107:6379"
3) "ip"
4) "192.168.0.107"
5) "port"
6) "6379"
7) "runid"
8) "e7ec2725eca9313417e6823ebf00ac3867c74abc"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "383"
19) "last-ping-reply"
20) "383"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "2896"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "34123"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "192.168.0.110"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "418255"
2) 1) "name"
2) "192.168.0.109:6379"
3) "ip"
4) "192.168.0.109"
5) "port"
6) "6379"
7) "runid"
8) "fe25299543962d47a64197664b281ce0d9e49410"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "300"
19) "last-ping-reply"
20) "300"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "300"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "585068"
29) "master-link-down-time"
30) "616000"
31) "master-link-status"
32) "err"
33) "master-host"
34) "192.168.0.110"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "297907"
三、总结
本章,我讲解了Redis哨兵模式的搭建,通过实战我们可以对Redis高可用的原理有更深的认识,读者可以尝试在本机按照我所述的步骤动手搭建,加深印象。