Off to do battle

Post by **BobbyK** » Sun Mar 02, 2014 10:45 pm

So, I've had a fun weekend here.

Saturday morning, $CABLEMONOPOLY replaces the cable modem at Customer Site with a new unit, to allow them to upgrade from 18/2 to 100/10 service.

Our remote management and monitoring system uses http or https hits to check in and communicate. Servers are set to check in every 30 seconds, workstations every 5 minutes.

Starting about 3 hours after the installer leaves, we start getting server down notifications, followed by server up a couple of minutes later (specifically between 4 minutes and 6 minutes later). The servers impacted are all at this site, but there's no rhyme or reason or pattern. It's a mix of physical and virtual machines. There's 12 servers at the site, and no more than 2 at a time report down at the same time during all of this.

End users call to bitch about websites timing out. For testing purposes, I temporarily disable the web proxy on the perimeter UTM, with no change. So it's not that.

I finally manage to get a wireshark capture from both our RMM server, and an impacted machine, capturing normal checkins, badness, and the resumption of normal checkins. Guess what I see?

On our server, I see several minutes of no packets, followed by what Wireshark flags as a TCP Retransmit, a bunch of TCP Duplicate ACKs.

On the customer side, I see ACKs suddenly stop, and then a bunch of retransmits.

Pretty clear cut, right?

Nope. MOUTHBREATHER#1 at $CABLEMONOPOLY says "Signal is good, and I can ping the modem. It's not our problem."

I would like to point out that this is the same $CABLEMONOPOLY that we had to get the state's Public Service Commission involved before they would correct and admit that they fucked up an ACL on a CMTS somewhere, and blocked port 5060 TCP/UDP for about 1000 of their customers.

I'm either going to need a new liver later, or bail money. I know where their support call center is, and I'm now on hold for 15 minutes waiting for someone at the NOC, assuming that the aforementioned mouthbreather is in fact escalating me as requested.

FML

Greg · Post by **Greg** » Sun Mar 02, 2014 11:34 pm

Flap on, flap off, the flapper!

Post by **BobbyK** » Sun Mar 02, 2014 11:41 pm

I did in fact use the phrase "flapping like a flock of geese" with the dude at the NOC. Who then proceeded to demand that a tech be sent onsite in the morning.

<sigh>

Post by **308Mike** » Mon Mar 03, 2014 12:44 am

I feel for ya' brother!!! I don't know how many times I did battle with our NOC in the middle between us and corporate. Our NOC would pass my data and testing only to have corporate say it wasn't their problem, and we'd argue it wasn't us - it was THEM causing it. I spent more than an entire weekend trouble-shooting a Microsoft Policy issue someone from upstream had created and was screwing up EVERY NEW MACHINE attached to the AD tree, but had no effect on stand-alone machines.

We were able to FINALLY get them to change the screwed up policy, but not after costing us and the company many HUNDREDS OF THOUSANDS of dollars in lost productivity and salaries - but only AFTER I was able to PROVE the problem was because of one of THEIR F'ED up policies pushed down to us, was causing the problem.

Talk about a head-scratcher!! I kept coming up with the same problems, the NOC kept saying they had no idea why it was happening, and Corp was saying they had nothing to do with it.

When they FINALLY got someone to review the MS policies modified by corporate (of course, they NEVER messed up our Linux/Unix machines used by our engineers), they FINALLY found the problem and meekly admitted they'd created it but never issued an apology for all the wasted time and resources tracking down (and PROVING) THEIR ERROR.

YES, I still get hot thinking about it.

I UNDERSTAND!!!

Post by **randy** » Mon Mar 03, 2014 2:01 am

If you need bail money after (allegedly) trashing said company's customer non-support center, I'm in.

First Shirt · Post by **First Shirt** » Mon Mar 03, 2014 2:10 am

I'm always willing to contribute to a worthy cause. I'm in!

Post by **BobbyK** » Mon Mar 03, 2014 7:12 pm

And to add insult to injury, three additional, widely dispersed sites on the same ISP have started flapping, as well.

The Gun Counter

Off to do battle

Off to do battle

Re: Off to do battle

Re: Off to do battle

Re: Off to do battle

Re: Off to do battle

Re: Off to do battle

Re: Off to do battle