Fixing VeriSign Certificates on Windows Servers

One item I’ve seen repeatedly cause issues in new Exchange or Lync environments centers around certificates from public providers such as VeriSign, Digicert, or Entrust. These providers generally use multiple tiers of certificates, so when you purchase a certificate it is generally issued by a subordinate, or issuing certificate authority instead of the root certificate authority. The way that SSL certificate chains work require an end client to only need to trust the top most, or root certificate in the chain, in order to accept the server certificate as valid. But in order to properly present the full SSL chain to a client a server must first have the correct trusted root and intermediate certificate authorities loaded. So the bottom line here is that if you haven’t loaded the full certificate chain on the server then you may see clients have trouble connecting.

This becomes especially problematic in the case of VeriSign’s latest chain. If you are using a modern Windows client such as Windows 7 or 2008 R2 you’ll see the VeriSign Class 3 Public Primary Certification Authority – G5 certificate which expires in 2036 with thumbprint ‎4e b6 d5 78 49 9b 1c cf 5f 58 1e ad 56 be 3d 9b 67 44 a5 e5 installed in the Trusted Root Certification Authorities by default. There is some extra confusion generated because there is also a VeriSign Class 3 Public Primary Certification Authority – G5 certificate which expires in 2021 with thumbprint ‎32 f3 08 82 62 2b 87 cf 88 56 c6 3d b8 73 df 08 53 b4 dd 27 installed in the Intermediate Certification Authorities by default. The names of these certificates are identical, but they are clearly different certificates expiring on different dates.

What you’ll find after purchasing a VeriSign certificate is that the CA which actually issues your server certificate, VeriSign Class 3 Secure Server CA – G3, is cross-signed by both of the G5 certificates. This means that there are now 2 different certificate chains you could present to clients, but what is actually presented depends on how you configure the server. The two chain options you can present are displayed below, and while one is a bit longer, both paths are valid.

image

So if a client trusts either of the G5 certificates as a trusted root, it will trust any certificate issued by a subordinate CA such as the G3. What ends up happening is that the certificate chain will look correct when a Windows 7 or 2008 R2 server connects to it, because those operating systems already have the 2036 G5 CA as a trusted root. You’ll see only 3 tier chain presented, and the connection will work just fine.

image

There’s nothing actually wrong with this if all you have are newer clients. In fact, that’s one advantage of cross-signing – that a client can leverage the shortest possible certificate chain. But any kind of downlevel client, such as Lync Phone Edition, does not trust that newer G5 CA by default. This means that when those devices try to connect to the site they are presented with the 2036 G5 certificate as the top-level root CA, and since they do not trust that root they will drop the connection. In order to support the lowest common denominator of devices the chain should actually contain 4 tiers, like in the following screenshot. Older devices typically have the VeriSign Class 3 Public Primary CA already installed as a trusted root, so you may get better compatibility this way.

image

The screenshots have been from the same certificate, but the difference is how the chain is presented. In order for a server to present the full chain you must log on to each server hosting the certificate and open the certificates MMC for the local computer. Locate the VeriSign Class 3 Public Primary Certification Authority – G5 certificate in the Trusted Root Certification Authority node, right-click, and open the Properties. Select Disable all purposes for this certificate and press OK to save your changes.

image

By disabling the incorrect trusted root certificate the server will now be presenting the full chain. The big ‘gotcha’ here is that you can’t easily test this. If you browse to the site from a Windows 7 client and open the Certification Path tab for the certificate it’s still going to look the same as before. The reason for this is that Windows 7 also has the VeriSign Class 3 Public Primary Certification Authority – G5 certificate in the Trusted Root Certification Authorities machine node by default. And because Windows 7 trusts that as a root CA, it will trust any certificate below that point. Certificate testing tools you find on the Internet also aren’t going to be much help here because they also already trust the 2036 G5 certificate. The only way you can verify the full chain is to delete or disable that cert from the client you’re testing on. And no, this is not something you should ever attempt on multiple machines – I’m suggesting this only for testing purposes. If you’re using any kind of SSL decryption at a load balancer to insert cookies for persistence you’ll want to make sure the load balancer admin has loaded the full chain as well.

So now you’ve fixed the chain completely, and after the next Windows Update cycle you’ll probably find the G5 certificate enabled again on the server. The root certificate updates for Windows will actually re-enable this certificate for you (how kind of them!), and result in a broken chain for older clients again. In order to prevent this from occurring you can disable automatic root certificate updates from installing via Windows Update. This can be controlled through a Group Policy setting displayed here:

image

Source IP Address Preference with Multiple IPs on a NIC

Something I’m finding myself doing more and more lately is using multiple IP addresses on a single NIC for a Windows server. The reasons vary, but it’s generally in order to support a single server running 2 different services on the same port. This can happen for Lync with your Edge servers (or for skirting the reverse proxy requirement on Front-Ends), or with Exchange when creating multiple receive connectors on a server.

A behavior that changed with the introduction of Server 2008 is that the source IP address on a NIC will always be the lowest numerical IP. So that whole idea of your primary IP being the first one you put on the NIC – throw that idea out the window.

For example, let’s say we build a new Exchange server and configure the NIC with IP 10.0.0.100. This IP is registered in DNS and the server uses this IP as the source when communicating with other servers. Our fantastic network administrator has also created a NAT rule on the firewall to map this IP to a particular public IP for outbound SMTP so that our PTR lookups match up.

But now we want to add another IP for a custom receive connector and the network admin hands you a free IP which happens to be 10.0.0.50. You add this as an additional IP on the NIC and voila – you have a couple issues:

  • You just registered two names for the same server in DNS if dynamic registration is enabled.
  • Your server is now sending all outbound traffic from 10.0.0.50! (because 50 is lower than 100)

One of these is easily solved – just turn off dynamic registration and manually create the DNS records for the server. The other one is a little trickier because Server 2008 and 2008 R2 will still be sending traffic as the 10.0.0.50 IP. In the case of Exchange, this could create some ugliness for outgoing SMTP because now your firewall is not NATing to the correct public IP and you start bouncing mail due to PTR lookup failures.

Fortunately, we have a way to tell Windows not to use the lower numbered IP as a source address by adding the IP via the netsh.exe command. For Server 2008 SP2 and 2008 R2 RTM we need to apply a hotfix first. 2008 R2 SP1 included this fix by default so it is no longer required. Without the hotfix or SP1 you’ll find netsh.exe does not display or recognize the special flag.

Hotfix Downloads:

The key to this is the IP address must be added via netsh.exe with a particular flag. So if you’ve already added the IP address via the GUI you’ll need to remove it first. After that, use this command to add the secondary IP:

netsh int ipv4 add address "Local Area Connection" 1.2.3.4/24 SkipAsSource=true

The SkipAsSource flag does two things – first, it instructs Windows not to use this IP as a source IP for outgoing traffic. And secondly, it prevents the registration of this IP in DNS if dynamic registration is enabled. Two birds with one stone!

You can always view the status of the IPs and their SkipAsSource status with the following command:

netsh int ipv4 show ipaddresses level=verbose

Broadcom NIC Teaming and Hyper-V on Server 2008 R2

The short of this is if you’re trying to use NIC teaming for the virtual adapter on Server 2008 R2 save yourself the headache, pony up a few extra dollars and buy Intel NICs.  The Broadcoms have a bug in the driver that prevents  this from working correctly on Server 2008 R2 Hyper-V when using a team for the Hyper-V virtual switch. Per the Broadcom driver release notes this is supposed to be a supported configured now, but it does not work correctly. There are two scenarios so far where I’ve been able to reproduce the problem:

  • VM guest has a static MAC assigned and is running on a VM host. Shut down the VM, assign it a dynamic MAC and start it again on the same host. You’ll find it has no network connectivity.

  • VM guest is running on VM Host A with a dynamic MAC. Live Migrate the VM guest to Host B. It has network connectivity at this point, but if you restart the VM on the opposite host you’ll find it receives a new MAC and no longer has network connectivity.

Take a look at this diagram (only showing NICs relevant to Hyper-V) and you’ll see what the setup is that causes the issue. We have 2 Broadcom NICs on Dell R710’s each connected to a different physical switch to protect against a port, NIC, or switch failure. They are teamed in an Active/Passive configuration. No load balancing or link aggregation going on here. The virtual adapter composed of the two team members is then passed through as a virtual switch to Hyper-V and it is not shared with the host operating system. The host itself has a team for its own management and for the Live Migration network, which I’ll point both work flawlessly – the issue here is purely related to Broadcom’s teaming through a Hyper-V virtual switch.

image

Say I have a VM running on Host A where the NIC team has a hypothetical MAC called MAC A. When it boots up, it receives a dynamic MAC address we’ll call MAC C from Host A’s pool. If you try to ping the VM guest’s IP 1.1.1.1 and then look at your ARP table you’ll see something like:

Internet Address Physical Address Type
1.1.1.1 MAC A Dynamic

This is because the NIC team is responsible for answering requests on behalf of the VM. When the NIC team receives traffic for the VM’s IP it will accept it, and then pass it along to the Hyper-V virtual switch. If you were to take a packet trace off the NIC you’ll see the team has modified the Layer 2 destination address to be MAC C, the dynamic MAC the VM got when it booted. This is how the teaming is supposed to work.

Now say I migrate the VM to Host B (where the NIC team has a MAC called MAC B) via Live or Quick migration. The VM retains connectivity and if you take a look at your MAC table you’ll now see something like:

Internet Address Physical Address Type
1.1.1.1 MAC B Dynamic

Yup, the MAC for Host B’s NIC team is now answering requests for the VM’s IP. Again, this is how the teaming is supposed to work. Everything is peachy and you might think your clustering is working out great, until you restart the VM.

image

When the VM restarts, upon booting it receives a new dynamic MAC from Host B’s pool and you’ll find it has no network connectivity. Your ARP table hasn’t changed (it shouldn’t, the same team is still responsible for the VM), but the guest has been effectively dropped. When I pulled out a packet trace what I noticed was the team was still receiving traffic for the VM’s IP, which ruled out a switching problem, but it was still modifying the packets and sending them to MAC C. When in fact, now the VM has restarted it has MAC D. The problem is that it seems somebody (the driver) forgot to notice the VM has a new MAC and is sending packets to the wrong destination, so the VM never receives any traffic.

image

I found that toggling the NIC team within the host actually fixes the problem. If you simply disable the virtual team adapter and then re-enable it the VM will instantly get its connectivity back so it seems that during the startup process the team reads the VM MACs it’s supposed to service. I would think this is something it should be doing constantly to prevent this exact issue, but for now it looks like it’s done only at initialization.

The most practical workaround I’ve found so far is to just set static MAC addresses on the VMs within the Hyper-V settings. If the VM’s MAC never changes, this problem simply doesn’t exist. So while that defeats the purpose of the dynamic MAC pool on a Hyper-V host it allows the teaming failover to operate properly while you restart VMs and move them between cluster nodes.

I’ve raised the issue with Dell/Broadcom and they agree it’s a driver problem. There is supposedly a driver update due mid-March, but no guarantees this will be addressed in that update. The next update isn’t slated until June which is a long time to wait, hence the recommendation to just use Intel NICs.

Other notes for the inquisitive:

  • Disabling the team and using only a single adapter makes this work properly.
  • Happens with or without all TOE, checksum and RSS features.
  • No VLAN tagging in use.
  • Issue persists when team members are plugged into the same switch.
  • Latest drivers from Dell/Broadcom (12/15/2009) as of this writing.
  • Happens whether teaming is configured before or after Hyper-V role is installed.