Saturday morning, $CABLEMONOPOLY replaces the cable modem at Customer Site with a new unit, to allow them to upgrade from 18/2 to 100/10 service.
Our remote management and monitoring system uses http or https hits to check in and communicate. Servers are set to check in every 30 seconds, workstations every 5 minutes.
Starting about 3 hours after the installer leaves, we start getting server down notifications, followed by server up a couple of minutes later (specifically between 4 minutes and 6 minutes later). The servers impacted are all at this site, but there's no rhyme or reason or pattern. It's a mix of physical and virtual machines. There's 12 servers at the site, and no more than 2 at a time report down at the same time during all of this.
End users call to bitch about websites timing out. For testing purposes, I temporarily disable the web proxy on the perimeter UTM, with no change. So it's not that.
I finally manage to get a wireshark capture from both our RMM server, and an impacted machine, capturing normal checkins, badness, and the resumption of normal checkins. Guess what I see?
On our server, I see several minutes of no packets, followed by what Wireshark flags as a TCP Retransmit, a bunch of TCP Duplicate ACKs.
On the customer side, I see ACKs suddenly stop, and then a bunch of retransmits.
Pretty clear cut, right?

I would like to point out that this is the same $CABLEMONOPOLY that we had to get the state's Public Service Commission involved before they would correct and admit that they fucked up an ACL on a CMTS somewhere, and blocked port 5060 TCP/UDP for about 1000 of their customers.
I'm either going to need a new liver later, or bail money. I know where their support call center is, and I'm now on hold for 15 minutes waiting for someone at the NOC, assuming that the aforementioned mouthbreather is in fact escalating me as requested.
FML