After remediating a cluster for a newly created baseline, one of the ESXi hosts really took it’s time to go into maintenance. Looking at the progress bar in the tasks viewer It looked like the process was frozen at a certain percentage, but it actually progressed extremely slow (see network monitor screenshot).
The host maintenance task ended with a time out error due to the amount of time it took to vMotion the VM’s from the server.
Okay. Troubleshoot time.
Looking at the network graph from the concerning vmnic’s, the slow performance is visible. Especially from vmnic5.
You can also check this in esxtop. ssh into the ESXi host and press the ‘n’ (network) to see the network activity from each vmnic.
So what could cause the slow performance on the vmnic. We keep our hosts up to date with the latest updates at least once a month. All the other ESXi hosts in the cluster don’t have this problem. The possibility of a network issue somehow was also ruled out after a session with the Network Colleagues.
Could it be a hardware issue?
This is the network configuration for the vMotion stack:
Uplink 1 > vmnic4
Uplink 2 > vmnic5
After excluding vmnic5 from the config the vMotion speed skyrocketed over vmnic4. See below network graph.
After excluding vmnic5
Values in esxtop after vmnic5 exclusion
It is safe to say the cause of the slow performance is hardware related.
After replacing the faulty CNA module the performance issues was solved.