2023-08-12
原文作者:Ressmix 原文地址:https://www.tpvlog.com/article/207

关于Redis哨兵模式的原理,我在进阶篇的《分布式框架之高性能:Redis哨兵模式》已经详细讲解过了,不熟悉的读者可以先去了解下。

本章,我将带领大家部署一个3节点的哨兵集群,并介绍如何基于哨兵进行故障转移,以及一些企业级的配置方案。

一、哨兵部署

我们先来构建一个哨兵集群,一般哨兵集群至少需要三个独立的机器节点,这样才能保证哨兵自身的高可用。我在前几个章节已经带领大家部署了一主二从的Redis架构。本章,我们就在之前的ressmix-dsf01、ressmix-dsf02、ressmix-dsf03上分别部署一个哨兵,运行在5000端口,整个架构逻辑图如下:

202308122224295031.png

1.1 哨兵配置

我以ressmix-dsf01节点为例,进行部署。哨兵的配置文件位于Redis安装包的根目录下,名叫sentinel.conf。我们先创建两个目录:

    #存放哨兵配置文件
    mkdir /etc/sentinal
    #哨兵的工作目录
    mkdir -p /var/sentinel/5000
    #存放哨兵日志
    mkdir -p /var/log/sentinel/5000

然后,把配置文件复制到/etc/sentinal目录,并重命名为5000.conf。接着,修改配置文件中的如下参数:

    port 5000
    bind 192.168.0.107
    daemonize yes
    logfile /var/log/sentinel/5000/sentinel.log
    
    dir /var/sentinel/5000
    sentinel monitor mymaster ressmix-dsf01 6379 2
    sentinel down-after-milliseconds mymaster 30000
    sentinel parallel-syncs mymaster 1
    sentinel failover-timeout mymaster 180000
    sentinel auth-pass mymaster ressmix

我来解释下比较重要的几个参数:

  • port 5000:就是哨兵自身运行的端口,默认是26379,我们修改成了5000;
  • bind 192.168.0.107:默认情况下,哨兵只能从127.0.0.1访问,我们修改成机器的IP,这样哨兵之间可以互相通信;
  • dir /var/sentinal/5000:哨兵自身的工作目录;
  • sentinel monitor mymaster ressmix-dsf01 6379 2:这里的mymaster就是哨兵要监控的集群名称,可以自定义,我这里命名为”mymaster“。ressmix-dsf01是这个集群中的Master节点的IP,我这里直接用主机名,6379就是Master节点的运行端口。最后一个2表示quorum,也就是当有quorum个哨兵认为主节点客观下线时,就开始哨兵Leader选举,进行故障转移,由于我们一共有三个哨兵,所以按照大多数原则,quorum=(3/2)+1=2;
  • down-after-milliseconds:表示如果哨兵超过这个时间都没法跟Redis主节点取得联系,那就可能认为这个redis实例挂了,即主观下线;
  • parallel-syncs:表示发生故障转移时,选举了新的Master节点,那同时挂载多少个Slave节点去同步数据,我这里用默认值1个;
  • failover-timeout:表示哨兵Leader进行故障转移的超时时间,如果超过这个时间还没做完故障转移,就会重新选举哨兵Leader主持故障转移,默认3分钟;
  • 最后,如果Master节点设置了认证口令,一定不要忘记在哨兵配置sentinel auth-pass中都加上密码。

全部配置完成后,我们执行以下命令启动哨兵:

    redis-server /etc/sentinal/5000.conf --sentinel

如下图所示,哨兵发现了Master的两个Slave节点,并且也发现了另外两个哨兵,哨兵之间会互相通过我在进阶篇中讲到过的消息发布/订阅机制进行通信:

202308122224301922.png

1.2 哨兵状态

我们可以通过命令检查下哨兵的状态:

    redis-cli -h 192.168.0.109 -p 5000

查看集群中的Master节点状态:

    192.168.0.109:5000> sentinel master mymaster
     1) "name"
     2) "mymaster"
     3) "ip"
     4) "192.168.0.107"
     5) "port"
     6) "6379"
     7) "runid"
     8) "5cb6aed556e853b886a0722170ef024aebc1ace4"
     9) "flags"
    10) "master"
    11) "link-pending-commands"
    12) "0"
    13) "link-refcount"
    14) "1"
    15) "last-ping-sent"
    16) "0"
    17) "last-ok-ping-reply"
    18) "362"
    19) "last-ping-reply"
    20) "362"
    21) "down-after-milliseconds"
    22) "30000"
    23) "info-refresh"
    24) "4481"
    25) "role-reported"
    26) "master"
    27) "role-reported-time"
    28) "285881"
    29) "config-epoch"
    30) "0"
    31) "num-slaves"
    32) "2"
    33) "num-other-sentinels"
    34) "2"
    35) "quorum"
    36) "2"
    37) "failover-timeout"
    38) "180000"
    39) "parallel-syncs"
    40) "1"

查看集群中的Slave节点的状态:

    192.168.0.109:5000> SENTINEL slaves mymaster
    1)  1) "name"
        2) "192.168.0.109:6379"
        3) "ip"
        4) "192.168.0.109"
        5) "port"
        6) "6379"
        7) "runid"
        8) "fe25299543962d47a64197664b281ce0d9e49410"
        9) "flags"
       10) "slave"
       11) "link-pending-commands"
       12) "0"
       13) "link-refcount"
       14) "1"
       15) "last-ping-sent"
       16) "0"
       17) "last-ok-ping-reply"
       18) "612"
       19) "last-ping-reply"
       20) "612"
       21) "down-after-milliseconds"
       22) "30000"
       23) "info-refresh"
       24) "9237"
       25) "role-reported"
       26) "slave"
       27) "role-reported-time"
       28) "340548"
       29) "master-link-down-time"
       30) "0"
       31) "master-link-status"
       32) "ok"
       33) "master-host"
       34) "192.168.0.107"
       35) "master-port"
       36) "6379"
       37) "slave-priority"
       38) "100"
       39) "slave-repl-offset"
       40) "77323"
    2)  1) "name"
        2) "192.168.0.110:6379"
        3) "ip"
        4) "192.168.0.110"
        5) "port"
        6) "6379"
        7) "runid"
        8) "168c3cb9c0162f91c6a047e8b20c0b1562356a2f"
        9) "flags"
       10) "slave"
       11) "link-pending-commands"
       12) "0"
       13) "link-refcount"
       14) "1"
       15) "last-ping-sent"
       16) "0"
       17) "last-ok-ping-reply"
       18) "612"
       19) "last-ping-reply"
       20) "612"
       21) "down-after-milliseconds"
       22) "30000"
       23) "info-refresh"
       24) "9237"
       25) "role-reported"
       26) "slave"
       27) "role-reported-time"
       28) "340842"
       29) "master-link-down-time"
       30) "0"
       31) "master-link-status"
       32) "ok"
       33) "master-host"
       34) "192.168.0.107"
       35) "master-port"
       36) "6379"
       37) "slave-priority"
       38) "100"
       39) "slave-repl-offset"
       40) "77323"

查看监控这个集群的其它哨兵的状态:

    192.168.0.109:5000> SENTINEL sentinels mymaster
    1)  1) "name"
        2) "cd48456d2f4342db47efb9f33bf679aa5b611e56"
        3) "ip"
        4) "192.168.0.110"
        5) "port"
        6) "5000"
        7) "runid"
        8) "cd48456d2f4342db47efb9f33bf679aa5b611e56"
        9) "flags"
       10) "sentinel"
       11) "link-pending-commands"
       12) "0"
       13) "link-refcount"
       14) "1"
       15) "last-ping-sent"
       16) "0"
       17) "last-ok-ping-reply"
       18) "72"
       19) "last-ping-reply"
       20) "72"
       21) "down-after-milliseconds"
       22) "30000"
       23) "last-hello-message"
       24) "221"
       25) "voted-leader"
       26) "?"
       27) "voted-leader-epoch"
       28) "0"
    2)  1) "name"
        2) "7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217"
        3) "ip"
        4) "192.168.0.107"
        5) "port"
        6) "5000"
        7) "runid"
        8) "7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217"
        9) "flags"
       10) "sentinel"
       11) "link-pending-commands"
       12) "0"
       13) "link-refcount"
       14) "1"
       15) "last-ping-sent"
       16) "0"
       17) "last-ok-ping-reply"
       18) "72"
       19) "last-ping-reply"
       20) "72"
       21) "down-after-milliseconds"
       22) "30000"
       23) "last-hello-message"
       24) "254"
       25) "voted-leader"
       26) "?"
       27) "voted-leader-epoch"
       28) "0"

二、容灾演练

哨兵集群部署完成后,我们可以进行下容灾演练,看看哨兵是不是真的做到了故障自动转移。现在,我这边d的Redis主从架构是下面这样的:

    Master,部署在ressmix-dsf01: 192.168.0.107
    Slave1,部署在ressmix-dsf02: 192.168.0.109
    Slave2,部署在ressmix-dsf03: 192.168.0.110

2.1 故障转移

我先把Master节点kill -9掉,然后把它的pid文件(/var/run/redis_6379.pid)也删除掉,用来模拟Master节点挂掉。等待30s后,T通过日志可以发现哨兵进行了故障自动转移,下面是ressmix-dsf01节点上的哨兵日志:

    1385:X 25 Apr 2020 14:16:45.880 # +sdown master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:45.935 # +odown master mymaster 192.168.0.107 6379 #quorum 3/2
    1385:X 25 Apr 2020 14:16:45.935 # +new-epoch 1
    1385:X 25 Apr 2020 14:16:45.935 # +try-failover master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:45.971 # +vote-for-leader 7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217 1
    1385:X 25 Apr 2020 14:16:45.973 # 8e23c3b5d6d9edc4dbb845dc8b8e858e4ce2142c voted for 8e23c3b5d6d9edc4dbb845dc8b8e858e4ce2142c 1
    1385:X 25 Apr 2020 14:16:46.020 # cd48456d2f4342db47efb9f33bf679aa5b611e56 voted for 7ed8bb8d42e7d443aa90d3d2cfabc9dbd8f77217 1
    1385:X 25 Apr 2020 14:16:46.037 # +elected-leader master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.037 # +failover-state-select-slave master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.090 # +selected-slave slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.090 * +failover-state-send-slaveof-noone slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.148 * +failover-state-wait-promotion slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.913 # +promoted-slave slave 192.168.0.110:6379 192.168.0.110 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.913 # +failover-state-reconf-slaves master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:46.932 * +slave-reconf-sent slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:47.127 # -odown master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:16:47.915 * +slave-reconf-inprog slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:19:46.900 # +failover-end-for-timeout master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:19:46.900 # +failover-end master mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:19:46.900 * +slave-reconf-sent-be slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.107 6379
    1385:X 25 Apr 2020 14:19:46.900 # +switch-master mymaster 192.168.0.107 6379 192.168.0.110 6379
    1385:X 25 Apr 2020 14:19:46.900 * +slave slave 192.168.0.109:6379 192.168.0.109 6379 @ mymaster 192.168.0.110 6379
    1385:X 25 Apr 2020 14:19:46.900 * +slave slave 192.168.0.107:6379 192.168.0.107 6379 @ mymaster 192.168.0.110 6379

可以看到,新的Master节点变成了192.168.0.110,也就是ressmix-dsf03。我们可以通过命令info replication看下,它的角色已经变成了Master:

    192.168.0.110:6379> info replication
    # Replication
    role:master
    connected_slaves:0
    master_replid:07383ad9fb832365fc67b9a578c54a36a21ff274
    master_replid2:f5180409b8a45a9f71eed3c8241bcedbc8986a48
    master_repl_offset:354070
    second_repl_offset:297908
    repl_backlog_active:1
    repl_backlog_size:1048576
    repl_backlog_first_byte_offset:1
    repl_backlog_histlen:354070

2.2 故障恢复

然后,我们再恢复ressmix-dsf01上的redis节点,这样它就会被作为Slave节点加入到集群中。我们可以通过任一哨兵看下集群的状态:

    192.168.0.109:5000> SENTINEL slaves mymaster
    1)  1) "name"
        2) "192.168.0.107:6379"
        3) "ip"
        4) "192.168.0.107"
        5) "port"
        6) "6379"
        7) "runid"
        8) "e7ec2725eca9313417e6823ebf00ac3867c74abc"
        9) "flags"
       10) "slave"
       11) "link-pending-commands"
       12) "0"
       13) "link-refcount"
       14) "1"
       15) "last-ping-sent"
       16) "0"
       17) "last-ok-ping-reply"
       18) "383"
       19) "last-ping-reply"
       20) "383"
       21) "down-after-milliseconds"
       22) "30000"
       23) "info-refresh"
       24) "2896"
       25) "role-reported"
       26) "slave"
       27) "role-reported-time"
       28) "34123"
       29) "master-link-down-time"
       30) "0"
       31) "master-link-status"
       32) "ok"
       33) "master-host"
       34) "192.168.0.110"
       35) "master-port"
       36) "6379"
       37) "slave-priority"
       38) "100"
       39) "slave-repl-offset"
       40) "418255"
    2)  1) "name"
        2) "192.168.0.109:6379"
        3) "ip"
        4) "192.168.0.109"
        5) "port"
        6) "6379"
        7) "runid"
        8) "fe25299543962d47a64197664b281ce0d9e49410"
        9) "flags"
       10) "slave"
       11) "link-pending-commands"
       12) "0"
       13) "link-refcount"
       14) "1"
       15) "last-ping-sent"
       16) "0"
       17) "last-ok-ping-reply"
       18) "300"
       19) "last-ping-reply"
       20) "300"
       21) "down-after-milliseconds"
       22) "30000"
       23) "info-refresh"
       24) "300"
       25) "role-reported"
       26) "slave"
       27) "role-reported-time"
       28) "585068"
       29) "master-link-down-time"
       30) "616000"
       31) "master-link-status"
       32) "err"
       33) "master-host"
       34) "192.168.0.110"
       35) "master-port"
       36) "6379"
       37) "slave-priority"
       38) "100"
       39) "slave-repl-offset"
       40) "297907"

三、总结

本章,我讲解了Redis哨兵模式的搭建,通过实战我们可以对Redis高可用的原理有更深的认识,读者可以尝试在本机按照我所述的步骤动手搭建,加深印象。

阅读全文