Discussion:
[ovirt-users] Strange fencing behaviour 3.5.3
Martin Breault
2015-09-11 19:14:23 UTC
Permalink
Hello,

I manage 2 oVirt clusters that are not associated in any way, they each
have their own management engine running ovirt-engine-3.5.3.1-1. The
servers are Dell 6xx series and the power-management is configured using
idrac5 settings and each cluster is a pair of hypervisors.

The engines are both in a datacenter that had an electrical issue, each
cluster is at a different unrelated location. The problem I had was
caused by a downed switch causing the individual engines to continue to
function, however no longer have connectivity to their respective
clusters. Once the switch was replaced (about 30 minutes of downtime) ,
when connectivity was resumed, both engines chose to fence one of the
two "unresponsive hypervisors" by sending an iDrac command to power down.

The downed hypervisor Cluster1 for some reason, 8 minutes later, got a
iDrac command to power-up again. When I logged into the engine, the
guests that were running on the powered-down host were in "off" state.
I simply powered them back on.

The downed hypervisor on Cluster2 stayed off, and was unresponsive
according to the engine, however the VMs that were running on it were in
an unknown state. I had to power on the host and click the "host has
been rebooted" dialog for the cluster to free these guests to be booted
again.

My question is, is it normal for the engine to fence one or more hosts
when it loses connectivity to all thehypervisors in the cluster? Is
there a minimum of 3 hosts in a cluster for it to not fall into this
mode? I'd like to know what I can troubleshoot or how I can avoid an
issue like this should the engine be disconnected from the hypervisors
temporarily and then resume connectivity only to kill the well-running
guests.

Thanks in advance,

Marty
Martin Perina
2015-09-15 07:51:56 UTC
Permalink
Hi,

sorry for late reponse I somehow missed your email :-(

I cannot completely understand you exact issue from the description,
but the situation when engine loses connection to all hypervisors
is always bad. Fortunately we made a few improvements in 3.5, which
should help in those scenarios. Please take a look at "Fencing policy"
tab in "Edit cluster" dialog:

1. Skip fencing if host has live lease on storage
- when host is connected to storage it has to renew its
storage lease at least every 60 secs
- so if the option is enabled and engine tries to fence the host
host using fence proxy (another host in cluster/DC which has
good connection), fence proxy checks if non responsive host
renewed its storage lease in the last 90 secs. And if lease
was renewed, fencing is aborted

2. Skip fencing on cluster connectivity issues
- if this options is enabled, engine test prior to fencing
how many of the hosts in the cluster has connectivity
issues. And if number of hosts with connectivity issues
is higher than the specified percentage, fencing is aborted
- of course this option is useless in clusters with less than
3 hosts

3. Enable fencing
- by disabling this option you can completely disable fencing
for hosts in the cluster
- this is usable in the situation when you expect connectivity
issues between engine and hosts (for example during switch
replacement), so you can disable fencing, replace the switch
and when connection is restored, enable fencing again
- however if you disable fencing completely, your HA VMs won't
be restarted on different hosts, so please use this option
with caution

Please let me known if have any other issues/questions with fencing.

Thanks

Martin Perina


----- Original Message -----
Sent: Friday, September 11, 2015 9:14:23 PM
Subject: [ovirt-users] Strange fencing behaviour 3.5.3
Hello,
I manage 2 oVirt clusters that are not associated in any way, they each
have their own management engine running ovirt-engine-3.5.3.1-1. The
servers are Dell 6xx series and the power-management is configured using
idrac5 settings and each cluster is a pair of hypervisors.
The engines are both in a datacenter that had an electrical issue, each
cluster is at a different unrelated location. The problem I had was
caused by a downed switch causing the individual engines to continue to
function, however no longer have connectivity to their respective
clusters. Once the switch was replaced (about 30 minutes of downtime) ,
when connectivity was resumed, both engines chose to fence one of the
two "unresponsive hypervisors" by sending an iDrac command to power down.
The downed hypervisor Cluster1 for some reason, 8 minutes later, got a
iDrac command to power-up again. When I logged into the engine, the
guests that were running on the powered-down host were in "off" state.
I simply powered them back on.
The downed hypervisor on Cluster2 stayed off, and was unresponsive
according to the engine, however the VMs that were running on it were in
an unknown state. I had to power on the host and click the "host has
been rebooted" dialog for the cluster to free these guests to be booted
again.
My question is, is it normal for the engine to fence one or more hosts
when it loses connectivity to all thehypervisors in the cluster? Is
there a minimum of 3 hosts in a cluster for it to not fall into this
mode? I'd like to know what I can troubleshoot or how I can avoid an
issue like this should the engine be disconnected from the hypervisors
temporarily and then resume connectivity only to kill the well-running
guests.
Thanks in advance,
Marty
_______________________________________________
Users mailing list
http://lists.ovirt.org/mailman/listinfo/users
Loading...