Not long ago I have noticed a 10G network interface flipping on one of the vSAN nodes.
I immediately started investigating this issue. An interesting thing was that this device was part of the integrated NIC on the server, and only it was generating connection errors.
After consulting with the Networks, we found the following:
- The interface was up,
- The port connecting to the switch had no traffic (can send to the server, but not receiving from the server),
- No errors were recorded,
- SFP signals were good.
The plan of attack was to replace SFPs – first on the server and, if it didn’t help, on the switch. During this operation we’ve found out the SFP on the server side was unusually warm. Unfortunately, replacing SFPs didn’t help and after approximately 15 minutes of complete silence, disconnects continued.
The next move was to contact a vendor. In our case, it was Dell EMC.
We’ve lodged a support request and sent the SupportAssist Collection to them. The response from the Support was to replace an embedded NIC with a new one. Considering the server was in use, it all sounded tricky to me.
However, thanks to the new algorithm which assigns device names for I/O devices beginning in ESXi 5.5, it all went smoothly. VMware states the following:
The number of ports on the embedded NIC hasn’t changed. As a result, hypervisor assigned the same aliases to the onboard ports.
ESXi initialised new ports and vSAN configuration was updated successfully without any human interaction.
As a bonus, when the server was booting after the card replacement, Lifecycle Controller detected an older version of firmware on the device and initiated a firmware update operation automatically.
All in all, I am impressed by how robust modern platforms both from Dell EMC and VMware.