在生产环境中,IP网段迁移的情况时有发生。Redis集群在初始化时绑定了每个节点的IP:PORT,一旦IP地址发生变化,集群便会陷入瘫痪。若希望保留原有数据并恢复集群正常运行,该如何操作呢?

多数场景下,可以直接清空所有节点的数据文件(dbfilename、持久化文件appendfilename、集群配置文件cluster-config-file),而后重新搭建集群。但如果数据不能丢失,就需要采用其他方法。
以三主三从单副本集群为例,演示数据完整的恢复流程
[root@test1 bin]# ./redis-cli -a password --cluster check 192.168.66.101:7000Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.192.168.66.101:7000 (d1ddeaa7...) -> 334 keys | 5461 slots | 1 sla ves.192.168.66.102:7003 (d21ce248...) -> 341 keys | 5462 slots | 1 sla ves.192.168.66.101:7001 (bb5c5e76...) -> 325 keys | 5461 slots | 1 sla ves.[OK] 1000 keys in 3 masters.0.06 keys per slot on a verage.>>> Performing Cluster Check (using node 192.168.66.101:7000)M: d1ddeaa7c77e35b3df50953fc09834b662cbac8b 192.168.66.101:7000 slots:[0-5460] (5461 slots) master 1 additional replica(s)M: d21ce2482179af3b76a9f29d870848bae18a3214 192.168.66.102:7003 slots:[5461-10922] (5462 slots) master 1 additional replica(s)S: 089b2e16dff1f68c399a1efc73580e7cbbbfa71b 192.168.66.101:7002 slots: (0 slots) sla ve replicates d21ce2482179af3b76a9f29d870848bae18a3214S: 92d8208b582c6111bd383b6fdfc2d80a86f47350 192.168.66.102:7005 slots: (0 slots) sla ve replicates d1ddeaa7c77e35b3df50953fc09834b662cbac8bS: ea68bec54e3deb0bd209f151151098ae6d8cf0b4 192.168.66.102:7004 slots: (0 slots) sla ve replicates bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65M: bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 192.168.66.101:7001 slots:[10923-16383] (5461 slots) master 1 additional replica(s)[OK] All nodes agree about slots configuration.>>> Check for open slots...>>> Check slots coverage...[OK] All 16384 slots covered.
假设集群初始节点的IP均为192.168.66.*,现在整体迁移到192.168.77.*网段。直接执行集群检查,会发现它仍固执地尝试连接旧IP,最终超时报错:
[root@test1 bin]# ./redis-cli -a password --cluster check 192.168.77.101:7000Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.Could not connect to Redis at 192.168.66.102:7003: Connection timed out......
首先需要停止所有节点。
关键文件是每个节点中 cluster-config-file 配置项指定的集群配置文件(本例中为 /data/redis/cluster/7000/nodes_7000.conf)。打开查看,里面记录了每个节点的IP和端口:
[root@test1 ~] cat /data/redis/cluster/7000/nodes_7000.confd1ddeaa7c77e35b3df50953fc09834b662cbac8b 192.168.66.101:7000@17000 myself,master - 0 1626244031000 1 connected 0-5460ea68bec54e3deb0bd209f151151098ae6d8cf0b4 192.168.66.102:7004@17004 sla ve bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 0 1626244034813 5 connectedd21ce2482179af3b76a9f29d870848bae18a3214 192.168.66.102:7003@17003 master - 0 1626244033803 4 connected 5461-10922089b2e16dff1f68c399a1efc73580e7cbbbfa71b 192.168.66.101:7002@17002 sla ve d21ce2482179af3b76a9f29d870848bae18a3214 0 1626244032793 4 connectedbb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 192.168.66.101:7001@17001 master - 0 1626244030770 2 connected 10923-1638392d8208b582c6111bd383b6fdfc2d80a86f47350 192.168.66.102:7005@17005 sla ve d1ddeaa7c77e35b3df50953fc09834b662cbac8b 0 1626244031782 6 connectedvars currentEpoch 6 lastVoteEpoch 0
解决思路十分直接:将所有节点配置文件中的旧IP替换为新IP。在每台机器上执行相应的替换命令即可:
# 192.168.66.101 执行sed -i 's/192.168.66/192.168.77/g' /data/redis/cluster/7000/nodes_7000.conf /data/redis/cluster/7001/nodes_7001.conf /data/redis/cluster/7002/nodes_7002.conf# 192.168.66.102 执行sed -i 's/192.168.66/192.168.77/g' /data/redis/cluster/7003/nodes_7003.conf /data/redis/cluster/7004/nodes_7004.conf /data/redis/cluster/7005/nodes_7005.conf
随后重新启动所有节点。
再次验证集群状态
[root@test1 bin]# ./redis-cli -a password --cluster check 192.168.77.101:7000Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.192.168.77.101:7000 (d1ddeaa7...) -> 334 keys | 5461 slots | 1 sla ves.192.168.77.102:7003 (d21ce248...) -> 341 keys | 5462 slots | 1 sla ves.192.168.77.101:7001 (bb5c5e76...) -> 325 keys | 5461 slots | 1 sla ves.[OK] 1000 keys in 3 masters.0.06 keys per slot on a verage.>>> Performing Cluster Check (using node 192.168.77.101:7000)M: d1ddeaa7c77e35b3df50953fc09834b662cbac8b 192.168.77.101:7000 slots:[0-5460] (5461 slots) master 1 additional replica(s)S: 92d8208b582c6111bd383b6fdfc2d80a86f47350 192.168.77.102:7005 slots: (0 slots) sla ve replicates d1ddeaa7c77e35b3df50953fc09834b662cbac8bM: d21ce2482179af3b76a9f29d870848bae18a3214 192.168.77.102:7003 slots:[5461-10922] (5462 slots) master 1 additional replica(s)M: bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65 192.168.77.101:7001 slots:[10923-16383] (5461 slots) master 1 additional replica(s)S: 089b2e16dff1f68c399a1efc73580e7cbbbfa71b 192.168.77.101:7002 slots: (0 slots) sla ve replicates d21ce2482179af3b76a9f29d870848bae18a3214S: ea68bec54e3deb0bd209f151151098ae6d8cf0b4 192.168.77.102:7004 slots: (0 slots) sla ve replicates bb5c5e768ab4aff9c92d7fd3f2d55007e2736c65[OK] All nodes agree about slots configuration.>>> Check for open slots...>>> Check slots coverage...[OK] All 16384 slots covered.[root@test1 bin]# ./redis-cli -a password --cluster info 192.168.77.101:7000 Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.192.168.77.101:7000 (d1ddeaa7...) -> 334 keys | 5461 slots | 1 sla ves.192.168.77.102:7003 (d21ce248...) -> 341 keys | 5462 slots | 1 sla ves.192.168.77.101:7001 (bb5c5e76...) -> 325 keys | 5461 slots | 1 sla ves.[OK] 1000 keys in 3 masters.0.06 keys per slot on a verage.
可以看到集群状态已恢复正常,各节点间通信畅通,key数量与IP变更前完全一致——数据毫发无损。
测试集群的数据写入与读取功能
[root@test1 bin]# ./redis-cli -a password -c -h 192.168.77.101 -p 7000 Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.192.168.77.101:7000> keys * 1) "name725" 2) "name359"......192.168.77.101:7000> get name7"hellon"192.168.77.101:7000> get name400-> Redirected to slot [11448] located at 192.168.77.101:7001"hellon"192.168.77.101:7001> set testkey 'testvalue'-> Redirected to slot [4757] located at 192.168.77.101:7000OK192.168.77.101:7000> get testkey"testvalue"
原有数据读取完全正确,新写入的数据也能正常存取,集群功能已完整恢复。
总结
整个恢复过程其实只需三步:停止所有节点 → 修改每个节点的 cluster-config-file 文件中的旧IP为新IP → 重启所有节点。集群会自动识别新IP、保留原有数据,实现无缝恢复。该方法适用于任何网段迁移场景,有效避免了删库重建造成的数据丢失风险。
