When an APD event occurs, LUNs connected to ESXi may remain inaccessible after paths to the LUNs recover. The 140-second APD timeout expires even though paths to storage have recovered. You see the following events in sequence in the /var/log/vmkernel.log:
- Device enters APD.
- Device exits APD.
- Heartbeat recovery and filesystem operations on the device fail due to "timeout" or "not found" or "busy".
- The APD timeout expires despite the fact that the device exited APD previously.
This is of course a real issue in production environments as the LUN doesn't recover after being gone for some seconds, no matter if the LUN is back online within the 140-seconds timeout.
As the LUN is not accessible anymore all of your virtual machines are affected (down) and some hosts might not come back to be managed in vCenter. Of course the hosts will continue to operate but it happens quite often that they get disconnected in vCenter.
To resolve the issue, you need to kill the outstanding I/O to the LUN and reboot the ESXi Host!
Kill the outstanding I/O: http://kb.vmware.com/kb/2014155