How to fix a frozen Virtual Machine that is stuck at 95% or timed out after trying to power off or restart
that's quite the catchy title eh? Well I couldn't think of any other way to put it.
I have had this problem 3 times now so hopefully I'll blog about this to potentialy save other people some more time
Symptoms: Your virtual machine has frozen. You open up the console and you get no mouse movement, it's literally frozen. It will not respond to ping requests, RDP, anything.
UPDATE*** 10/15/2009** THIS METHOD WORKS FOR ESX3.5. HAVE NOT TESTED ON VSPHERE. THIS WILL NOT WORK ON ESXi 4
"@LeGrandMeaulnes @KendrickColeman So you may want to make a note ... those steps don't work in ESXi 4. No vmware-cmd, nothing useful in /proc."
The thing every n00b (even myself) tries to either hit the restart, or power off button from Virtual Center. You will notice that once this happen it starts the process then dies at 95%. After about 7-10 minutes you will get an error from virtual center saying 'the machine is no longer responding' (or something like that)
Try to do this instead of hitting the power button:setup a PuTTY session to the host that has the frozen VM running.
use 'cd' to get into the LUN or share where the VM is held and where it's files are held. In this case, we will be using a machine by the name of FRX.
Once you get to the VM's folder and its contents we will view the state of the VM by typing the command: vmware-cmd VM.vmx getstate
you will see if the VM is on or off. If it's off, you're done. If it's on, we still have more work to do
try running the command:
vmware-cmd FRX.vmx stop hard
This command forces, well a hard stop on the process. More than likely, this will error out and you will not be able to stop the machine
The next task is to find the VMID of your particular VM
From your PuTTY session type in:
a list will scroll of all the VMs that are running on that host. In the left column you will see a VMID. Remember the VMID of the frozen VM you are trying to fix.
Type in (note *VMID* is where you type in your VMID):
less -S /proc/vmware/vm/*VMID*/cpu/status
this will open up a screen where you will see information about the particular VM. Scroll to the right by pressing the right arrow. Close to the end of the row there will be a column that says:
That SOME-NUMBER is usually 1 less than VMID. press q to quit.
Now it is time to kill the running process on that VM. To run the kill command on that ID type (note: *VMID-1* is the number found in the previous step that is represented by SOME-NUMBER):
/usr/lib/vmware/bin/vmkload_app -k 9 *VMID-1*
Give it a few minutes and continually check the state of the VM by typing: vmware-cmd FRX.vmx getstate
Hopefully in a few minutes you will see that your VM has powered off.
If you continue to have problems (such as I did) here are some more methods to try after doing the above. This is from my experiance so symptoms and results may differ.
Once the machine powered off, Virtual Center said the machine was 'invalid'. To remedy this I tried removing the VM from virtual center's inventory, going back into the LUN, and re-adding the VM to inventory.
At this point, Virtual center got stuck on the 'processing' stage and froze up. I had to reboot my virtual center server.
Upon reboot of my Virtual Center server, I tried re-adding the machine into inventory and it went much faster, I then tried to power on the machine, it would get to about 90% then fail. After some investigation I found out that the FRX-flat.vmdk file was locked by another host. A way to remedy this is by restarting the vpxa service on every host. I decided (well the VMWare rep decided) it was best to try cloning.
I opened up the VM settings under virtual center and removed the 'hard drive' that was locked from the machine. Clicked on OK and reconfigured the virtual machine.
I opened up a PuTTY session to the host that held the VM and changed directories until I got to the storage for the VM.
once I was in the folder of the VM I typed:
vmfsktools -i FRX-flat.vmdk FLR_new-flat.vmdk
this command clones the drive and makes sure it isn't locked by another host. At this point you will have to wait for the disk to clone and can take a while depending up the size of the disk. a 25gb drive took about 15 minutes. After the clone completed I went back to virtual center, and added a new 'hard disk' to the VM. This time selecting the newly created drive, clicked OK, and powered on the VM. We ran into one small problem because originally the SCSI Adapter was set to BusLogic because it was a P2V. Make sure the SCSI Adpater is set to LSI Logic, and you are good to go.
Hope this helps some people out there having problems.