backend static
server node1.cluster.com 2.2.2.5:80 check inter 2000 rise 2 fall 5
backend dynamic
server node2.cluster.com 2.2.2.6:80 check inter 2000 rise 2 fall 5
如 A B C D E
AB是同一个failover domain 只能AB之间failover
BCD是同一个failover domain B优先级为2,C优先级为1,D优先级为3 如果C挂了,则优先failover给B
CDE是同一个failover domain 不配置优先级的话,挂一个,任意failover给另两个之一
3,在安装luci的机器上,启动luci
# /etc/init.d/luci start
Point your web browser to https://li.cluster.com:8084 (or equivalent) to access luci
--启动时的提示就是到时候这样conga的web访问接口路径
# chkconfig luci on
在安装ricci的机器上(node1,node2)启动ricci
# /etc/init.d/ricci start
# lsof -i:11111
# chkconfig ricci on
4,手动尝试在node1和node2启动cman服务(cluster manager服务),会看到下面的错误,我们要提前解决,因为后面配置图形时,集群启动会自动按顺序来启动cman,rgmanger(主备切换相关服务),modclusterd服务;
如果这里不解决,图形配置完后启动集群也会造成cman启不了
# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager...
Network Manager is either running or configured to run. Please disable it in the cluster.
[FAILED] --启动时会发现与NetworkManager服务冲突,所以要chkconfig NetworkManager off掉这个服务(光stop这个服务不行),再来启动cman
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
所以以下步骤在所有节点(node1和node2)上都做
# /etc/init.d/NetworkManager stop
# chkconfig NetworkManager off
# chkconfig cman on
# chkconfig rgmanager on
# chkconfig modclusterd on
--这里不用start,因为还没有配置集群,start不启来,只是做成开机自动启动
所有节点(node1和node2)需要为ricci创建验证密码
# passwd ricci --创建的这个密码就是节点通迅的密码
Changing password for user ricci.
New password:
BAD PASSWORD: it is WAY too short
BAD PASSWORD: is too simple
Retype new password:
passwd: all authentication tokens updated successfully.
--见图rhcs02.png
如果ricci密码写错的话,在创建cluster时,会报如下的错误
The following errors occurred while creating cluster "web_ha": Authentication to the ricci agent at node1.cluster.com:11111 failed, Authentication to the ricci agent at node2.cluster.com:11111 failed
9,再回到conga图形界面,把刚才配置的service给启动
把apache_service服务前的勾打上,然后点上面的start(要看前面rhcs08.png图里Automatically Start This Service这个选项是否打了勾,如果已经打了勾,这里都不用再start了,httpd一安装完后,会自动在主上start)
刷新后就可以看到状态了,我这里apache启动在node1.cluster.com上
--见rhcs10.png
或者到节点(node1和node2都行)的shell命令行使用clustat命令查看也行
[root@node1 ~]# clustat
Cluster Status for web_ha @ Sat Feb 22 15:06:43 2014
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node1.cluster.com 1 Online, Local, rgmanager
node2.cluster.com 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:apache_service node1.cluster.com started
================================================================
--在rhel6.5上遇到下面的问题,网上查不到,初步断定是rhel6.5上版本稳定性的问题
Starting cluster "web_ha" service "apache_service" from node "node1.cluster.com" failed: parseXML(): couldn't parse xml
或者下面的报错
Restarting cluster "web_ha" service "apache_service" from node "node1.cluster.com" failed: apache_service is in unknown state 115
Member Name ID Status
------ ---- ---- ------
node1.cluster.com 1 Offline --这里看到node1是offline了
node2.cluster.com 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:apache_service node2.cluster.com started --这里看到node2抢到资源了
Member Name ID Status
------ ---- ---- ------
node1.cluster.com 1 Online, rgmanager --node1启动成功后,又变为online了
node2.cluster.com 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:apache_service node1.cluster.com started --这里当node1重启OK后,又抢回资源了,因为我前面的no failback选项没有打勾(表示优先级高的启动后,会再抢回资源)
还可以查看node2的/var/log/message的日志来查看刚才的过程状态
相关日志如下
Jul 19 20:31:46 node2 rgmanager[4926]: Member 1 shutting down
Jul 19 20:31:46 node2 rgmanager[4926]: Starting stopped service service:apache_service
Jul 19 20:31:47 node2 rgmanager[11725]: [ip] Adding IPv4 address 192.168.122.100/24 to eth2
Jul 19 20:31:49 node2 avahi-daemon[1452]: Registering new address record for 192.168.122.100 on eth2.IPv4.
Jul 19 20:31:51 node2 rgmanager[11826]: [script] Executing /etc/init.d/httpd start
Jul 19 20:31:52 node2 rgmanager[4926]: Service service:apache_service started
Jul 19 20:32:16 node2 corosync[4657]: [QUORUM] Members[1]: 2
Jul 19 20:32:16 node2 corosync[4657]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 19 20:32:16 node2 corosync[4657]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.140) ; members(old:2 left:1)
Jul 19 20:32:16 node2 corosync[4657]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 19 20:32:16 node2 kernel: dlm: closing connection to node 1
Jul 19 20:32:24 node2 rgmanager[12138]: [script] Executing /etc/init.d/httpd status
Jul 19 20:32:49 node2 corosync[4657]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 19 20:32:49 node2 corosync[4657]: [QUORUM] Members[2]: 1 2
Jul 19 20:32:49 node2 corosync[4657]: [QUORUM] Members[2]: 1 2
Jul 19 20:32:49 node2 corosync[4657]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.139) ; members(old:1 left:0)
Jul 19 20:32:49 node2 corosync[4657]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 19 20:32:54 node2 rgmanager[12560]: [script] Executing /etc/init.d/httpd status
Jul 19 20:32:57 node2 kernel: dlm: got connection from 1
Jul 19 20:33:09 node2 rgmanager[4926]: State change: node1.cluster.com UP
Jul 19 20:33:09 node2 rgmanager[4926]: Relocating service:apache_service to better node node1.cluster.com
Jul 19 20:33:10 node2 rgmanager[4926]: Stopping service service:apache_service
Jul 19 20:33:10 node2 rgmanager[12778]: [script] Executing /etc/init.d/httpd stop
Jul 19 20:33:10 node2 rgmanager[12843]: [ip] Removing IPv4 address 192.168.122.100/24 from eth2
Jul 19 20:33:10 node2 avahi-daemon[1452]: Withdrawing address record for 192.168.122.100 on eth2.
Jul 19 20:33:21 node2 rgmanager[4926]: Service service:apache_service is stopped
3,我们使用kvm来模拟fence电源设备,命令是fence_virsh,但是在conga图形配置界面是没有fence_virsh的选项,但有一个APC power switch的选项,它会调用fence_apc命令;所以我们在这里做一个欺骗,让它在调用fence_apc时实际是调用fence_virsh
一些常见的问题总结:
1,一般来说,在conga图形界面修改配置后,点提交,改变的配置会应用到所有的结点
,可以查看所有节点的/etc/cluster/cluster.conf里的配置版本号来确认是否两边一致
如果两边不一致,就可以把新版本号(config_version数字大的)的cluster.conf用scp覆盖到另一边节点,使两边一致
也可以使用ccs_sync命令来同步.下面就是把node2上的cluster.conf同步到node1上的过程
[root@node2 ~]# ccs_sync --第一次同步,需要先输入自己的ricci的密码,再输入对方的ricci的密码
You have not authenticated to the ricci daemon on node2.cluster.com
Password:
You have not authenticated to the ricci daemon on node1.cluster.com
Password:
[root@node2 ~]# ccs_sync --第二次同步就不需要密码了