Redis哨兵集群 -- 主从切换实验

哨兵（Sentinel）

本质上一种特殊模式的Redis服务，可以看作是Redis服务集群的“客户端”，通常也以集群形式存在。

只保证Redis集群高可用，不保证数据零丢失。

作用

This is the full list of Sentinel capabilities at a macroscopical level (i.e. the big picture):

Monitoring. Sentinel constantly checks if your master and replica instances are working as expected.

Notification. Sentinel can notify the system administrator, or other computer programs, via an API, that something is wrong with one of the monitored Redis instances.

Automatic failover. If a master is not working as expected, Sentinel can start a failover process where a replica is promoted to master, the other additional replicas are reconfigured to use the new master, and the applications using the Redis server are informed about the new address to use when connecting.

Configuration provider. Sentinel acts as a source of authority for clients service discovery: clients connect to Sentinels in order to ask for the address of the current Redis master responsible for a

given service. If a failover occurs, Sentinels will report the new address.

监控。Sentinel定期检查master和slave实例是否正常工作。
通知。Sentinel可以通过API当某个Redis出现故障时通知管理员。
自动故障转移。如果master出现故障，Sentinel可以通过failover程序把某个slave晋升为master，其他slave可以被重新配置连接到新的master上，同时使用Redis服务的应用会在连接时被通知到新的地址。
配置。Sentinel可以作为客户端服务发现的认证资源：客户端通过连接Sentinels获取当前Redis集群master节点的地址，如果有故障发生，Sentinels会返回新的地址。

数据丢失场景

异步复制丢失

1. 客户端向M(master)设置数据，比如'x'。
    +----+
    | C  |
    | S3 |
    +----+
       |
    +----+          +----+
    | M  |----+-----| R1 |
    | S1 |          | S2 |
    |'x' |          |    |
    +----+          +----+

2. 由于master是通过异步复制发送数据给slave，所以当M还没来得及把'x'发送给slave的时候就出现故障，Sentinel把R1提拔成了master，然后客户端C重新连接到M',这时'x'数据是丢失的。
                    +----+
                    | C  |
                    | S3 |
                    +----+
                       |
    +----+          +----+
    | M  |----//----| M‘ |
    | S1 |          | S2 |
    |'x' |          |    |
    +----+          +----+

处理方案，配置 /etc/redis/6379.conf。

# 至少有1个slave能被master写
# 1. 如果master自己挂了，那么连接不到任何一个slave。
# 2. 如果所有slave挂了，那么master无法同步数据出去，Redis集群无法提供读服务。
min-slaves-to-write 1

# slave同步延迟超过10秒
min-slaves-max-lag 10

# 若集群满足以上两个条件，那么master不会再接收任何客户端请求

此时应在客户端做降级，把数据暂存到本地缓存或磁盘，同时客户端对外做限流，减缓请求涌入。

集群脑裂

—d–

自动发现

哨兵之间的相互发现是通过redis的sub/pub系统实现的，每个哨兵会在_sentinel_:hello这个channel里发送一个消息，这时候所有其他哨兵都可以消费到这个消息，并感知其他哨兵都存在。
每隔2秒，每个哨兵会往自己监控的master+slave对应的_sentinel_:hello里发送一个消息，内容是自己的host，ip和runid，还有对这个master的监控配置。每个哨兵也会去监听这些channel，感知其他哨兵。
每个哨兵还会跟其他哨兵交换对master的监控配置，互相进行监控配置的同步。

选举slave

根据slave与master断开连接的次数，比如超过10次，并且每次都超过了配置的最大失联时间(down-after-milliseconds option)，那么它就会被sentinel认为不适合做新的master。
根据slave节点的replica-priority（优先级，默认100）选举，越小越优先。
priority相同。根据节点的offset选举，offset越往后，说明复制master的数据越多，越优先成为master。
offset也相同，根据节点的run id选举（每个Redis实例启动时都会被分配一个run id），越小越优先。

选举执行failover的sentinel

Sentinel集群不会并发处理failover，只会有一个Sentinel进行这个操作。即上文提到的需要majority数量的Sentinel给予授权其中一个Sentinel，这个Sentinel才会执行failover。

Raft算法

切换后传播配置

哨兵会对一个Redis集群进行监控，有相应的监控配置。
执行切换的那个哨兵，会从切换到新的master哪里得到一个configuration epoch，这就是一个version号，每次切换后拿到的version号都必须是唯一的。
如果di 一次选举出的哨兵切换失败，那么其他哨兵会等待failover-timeout时间，然后接替继续执行切换，此时会获取一个新的configuration epoch，作为新的version。
完成切换后，哨兵会在自己本地更新生成最新的master配置，然后同步给其他哨兵（通过pub/sub消息机制）。

集群实验

配置

Redis集群节点间认证

因为Redis集群中任意节点都有可能是master或者slave，所以如果有认证需求，就需要给集群中所有的节点都设置认证密码。修改 /etc/redis/6379.conf。(也可以全都不设置密码)

requirepass redispass
masterauth redispass

Sentinel集群节点配置

拷贝Redis目录下的sentinel.conf文件到/etc/redis/sentinel/26379.conf，并修改26379.conf。

# 必须设置bind绑定节点主机的hostname或ip<