We had a WAN router upgrade and it shouldn’t have affected the LAN connectivity. However, taking advantage of the scheduled downtime, firmware upgrade was performed on the network switches as well. The firmware upgrade required rebooting of the switches and this caused all the vSphere Hosts to go into isolated mode since there was no LAN connectivity and thus all heartbeat monitoring will fail. This had caused all the virtual machines to be shutdown based on the host isolation response setting.
As part of the VMware HA configuration, “Enable Host Monitoring” is enabled.
Host monitoring needs to be enabled in order for vSphere hosts within the same cluster to monitor heartbeats. This enables restarting of virtual machines on another host if the original host fails. VMware Fault Tolerance (FT) will also requires Host monitoring to be enabled for the recovery process to work properly.
For our case, all the vSphere hosts lost their network connectivity instead of failure. This means all the vSphere hosts were still running but cannot communicate with each other (no heartbeat) and cannot ping the default gateway. In this case, all the vSphere hosts are considered isolated and the Host Isolation Response will kick in to handle the virtual machines.
The setting for the Host Isolation Response will either one of the following.
- Leave powered on: As implied, virtual machines continue to run.
- Power off: Powering off virtual machines forcibly. Virtual machines without VMware Tools installed or exceed the timeout threshold of shutting down will be power off.
- Shut down: Perform a graceful shutdown. VMware Tools need to be installed on virtual machines.
Shut down was the default setting for us and so all our virtual machines were shut down.
To prevent this from happening, the Host monitoring should be disabled before performing the firmware upgrade. Disabling the Host monitoring will suspense the Host Isolation Response. Re-enable the Host monitoring after the firmware upgrade has completed.
To add a bit of complication, our vCenter is a virtual machine so we need to connect to the vSphere host using the vSphere Client to bring up the vCenter first.
It is sure no fun powering up all the virtual machines and making sure all of them are still intact. On the bright side, it happened within scheduled downtime and we managed to reinstate the virtual machines in time.
Hi,
ReplyDeletei'm facing the same problem you descibred in your post.
how did you manage to get the hosts out of isolation ?
meanwhile i managed to get the vm with the vcenter onit working again. restarting the management & vpxa services didn't do the trick.
thanks in advance.
regards.
My hosts went into isolation because of temporary LAN outage. Everything is back to normal after the LAN connectivity is back online again.
ReplyDeleteIf you keep facing with this issue, then you probably need to check on the network side.