Index:


Funny

  • Gamers are weird

    Many years ago I worked for a small Mom-and-Pop type ISP in New York state (I was the only network / technical person there) — it was a very free wheeling place and I built the network by doing whatever made sense at the time.

    One of my “favorite” customers (Joe somebody) was somehow related to the owner of the ISP and was a gamer. This was back in the day when the gaming magazines would give you useful tips like “Type ‘tracert $gameserver’ and make sure that there are less than N hops”.  Joe would call up tech support, me, the owner, etc and complain that there was N+3 hops and most of them were in our network. I spent much time explaining things about packet-loss, latency, etc but couldn’t shake his belief that hop count was the only metric that mattered.

    Finally, one night he called me at home well after midnight (no, I didn’t give him my home phone number, he looked me up in the phonebook!) to complain that his gaming was suffering because it was “too many hops to get out of your network”. I finally snapped and built a static GRE tunnel from the RAS box that he connected to all over the network — it was a thing of beauty, it went through almost every device that we owned and took the most convoluted path I could come up with. “Yay!”, I figured, “now I can demonstrate that latency is more important than hop count” and I went to bed.

    The next morning I get a call from him. He is ecstatic and wildly impressed by how well the network is working for him now and how great his gaming performance is. “Oh well”, I think, “at least he is happy and will leave me alone now”. I don’t document the purpose of this GRE anywhere and after some time forget about it.

    A few months later I am doing some routine cleanup work and stumble across a weird looking tunnel — its bizarre, it goes all over the place and is all kinds of crufty — there are static routes and policy routing and bizarre things being done on the RADIUS server to make sure some user always gets a certain IP… I look in my pile of notes and old configs and then decide to just yank it out.

    That night I get an enraged call (at home again) from Joe *screaming* that the network is all broken again because it is now way too many hops to get out of the network and that people keep shooting him…

    What I learnt from this:

    1: Make sure you document everything (and no, the network isn’t documentation)
    2: Gamers are weird.
    3: Making changes to your network in anger provides short term pleasure but long term pain.

  • What happens when a Foundry loses its mind

    At a previous company we had a large number of Foundry Networks layer-3 switches. They participated in our OSPF network and had a really annoying bug. Every now and then one of them would get somewhat confused and would corrupt its OSPF database (there seemed to be some pointer that would end up off by one). It would then cleverly realize that its LSDB was different to everyone else’s and so would flood this corrupt database to all other OSPF speakers. Some vendors would do a better job of sanity checking the LSAs and would ignore the bad LSAs, other vendors would install them — now you have different link state databases on different devices and OSPF becomes unhappy.

    Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.9.32.5 
    Mask 10.160.8.0 from 10.178.255.252 
    NOTE: This route will not be installed in the routing table.
    Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3
    Mask 10.2.153.0 from 10.178.255.252 
    NOTE: This route will not be installed in the routing table.
    Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3
    Mask 10.2.153.0 from 10.178.255.252 
    NOTE: This route will not be installed in the routing table.

     If you look at the output, you can see that there is some garbage in the LSID field and the bit that should be there is now in the Mask section. I also saw some more extreme version of the same bug, in my favorite example the mask was 115.104.111.119 and further down there was 105.110.116.114 — if you take these as decimal number and look up thier ASCII values we get “show” and “inte” — I wrote a tool to scrape bits from these errors and ended up with a large amount of the CLI help text. 

  • The fun of working with gamers

    Many years ago I worked for a small Mom-and-Pop type ISP in New York state (I was the only network / technical person there) — it was a very free wheeling place and I built the network by doing whatever made sense at the time.

    One of my “favorite” customers (Joe somebody) was somehow related to the owner of the ISP and was a gamer. This was back in the day when the gaming magazines would give you useful tips like “Type ‘tracert $gameserver’ and make sure that there are less than N hops”.  Joe would call up tech support, me, the owner, etc and complain that there was N+3 hops and most of them were in our network. I spent much time explaining things about packet-loss, latency, etc but couldn’t shake his belief that hop count was the only metric that mattered.

    Finally, one night he called me at home well after midnight (no, I didn’t give him my home phone number, he looked me up in the phonebook!) to complain that his gaming was suffering because it was “too many hops to get out of your network”. I finally snapped and built a static GRE tunnel from the RAS box that he connected to all over the network — it was a thing of beauty, it went through almost every device that we owned and took the most convoluted path I could come up with. “Yay!”, I figured, “now I can demonstrate that latency is more important than hop count” and I went to bed.

    The next morning I get a call from him. He is ecstatic and wildly impressed by how well the network is working for him now and how great his gaming performance is. “Oh well”, I think, “at least he is happy and will leave me alone now”. I don’t document the purpose of this GRE anywhere and after some time forget about it.

    A few months later I am doing some routine cleanup work and stumble across a weird looking tunnel — its bizarre, it goes all over the place and is all kinds of crufty — there are static routes and policy routing and bizarre things being done on the RADIUS server to make sure some user always gets a certain IP… I look in my pile of notes and old configs and then decide to just yank it out.

    That night I get an enraged call (at home again) from Joe *screaming* that the network is all broken again because it is now way too many hops to get out of the network and that people keep shooting him…

    What I learnt from this:

    1: Make sure you document everything (and no, the network isn’t documentation)
    2: Gamers are weird.
    3: Making changes to your network in anger provides short term pleasure but long term pain.

  • The day the circuit died….

    So, I am working at this dotcom in New York City. They have some offices on the seventh floor of the building and more offices on the 19th floor — and a really slow elevator between them. The actual servers, database machine, etc all lived in a datacenter a few miles away and we had a single T1 from the offices to the datacenter (for managment).

    The DBA folks have some huge maintenance planned — I’m not a database guy, but it sounded all impressive and they had been planning it for weeks. They had this big plan that involved failing over from one database server to the other, making some changes and then failing back — there would be an outage, but it would be less than 10 seconds and the backends would just queue requests for that period. 

     Anyway, they start the maintenance and get to the critical 10 second bit… and suddenly their SSH sessions stops responding. After the initial panic that somehow unmounting the database had made the machine die we figured out that the network from the office to the database was down. I run of an log into the router — the the serial to the T1 (that had been up and stable for more than a year) is now showing down/down. I hit our website (Internet access goes though a different, larger circuit) and the site is not showing “We’re sorry, $COMPANY is down for maintenance” — the backends have decided that they cannot reach the database for too long and are failing helthchecks — not good.

    I start calling the carrier and shout to the guy I work with “Joe, run up to the demarc on the 19th floor and check the light on the SmartJack” — Joe scurries off and runs up 12 flights of stairs.. A few minutes later he rushes into the room with sweat dripping of his forehead. “What did hte lights show?!”, I ask. Apparently he has forgotten the combination to the push button lock on the door — I shout out “1-3, 2 5” and Joe runs off again — up and down 12 flights of stairs.. He comes panting in and says the combination doesn’t work. I swear quietly and explain that you need to press 1 and 3 together and then 2 and then 5. Joe runs off again. A few minutes later and the circuit is still down… Joe comes stumbling back in looking like he is about to pass out but also looking oddly sheepish.

     

    “What happened? What did the lights show?”

    “Well, I run up all the stairs…”

    “Yes?!”

    “And then type in number on the door….”

    “Yes — what did the lights show?!” (I am starting to lose my cool about now)

    “And I find the demarc and I find the shelf and I look at the lights…”

    “JOE! What did the lights show?!!!!”

    “and then I remember — I’m color-blind… I can’t tell if the lights are red or green”.

     

  • Lockable Cages

    So I am working at this dotcom in New York — our main datacenter is a few subway stops away from the office.  Our cage in the datacenter is one of those standard cages made of out if the really thick steel mesh type stuff with a sliding door.

     

    Bob (not his real name) goes off to go install some new gear in the datacenter and I take the subway over to Grand Central to take the train home. I’m sitting on the train about to pull out of the station when I get this frantic call from Bob — it went something like this”

    “Warren! You’ve got to help me!”

    “Ok,calm down — whats wrong?!” (I’m picturing flames shooting out of routers, etc a this point)

    “I’m STUCK!”

    “Huh? What?”

    “So, I got someone to open the cage for me…”

    “Ok..”

    “And then I had to move the router over, so I slid the door shut…”

    “Ok…”

    “and now the lock is jammed! I’M STUCK!”

    At this point I lose it and start giggling manically. I explain that he just turns the knob on the inside of the door, but he says it is jammed. I suggest climbing over the top of the cage wall — apparently its too high and Bob is sounding fairly freaked by this point. I suggest lifting a floor tile or shouting, but he is not longer listening to me, so I tall him I’ll call the datacenter and get someone to come rescue him. I hang up with him…. as the train pulls out of the Grand Central stations and is underground for about 1/2 an hour — with no cell phone reception.

     

    Eventually I get cell phone reception back and call the datacenter owner — “Ok, once you stop laughing, could you please go let Bob out of cage 19-314? Apparently he has locked himself in….”

     

  • Switch with uptime > 6 years

    This was posted on NANOG by someone (I cannot remember who).

    ----------------------- 
    c2948g-4.sc5> sh ver
    WS-C2948 Software, Version NmpSW: 5.5(2)
    Copyright (c) 1995-2000 by Cisco Systems, Inc.
    NMP S/W compiled on Jul 28 2000, 17:21:27
    GSP S/W compiled on Jul 28 2000, 15:57:45

    System Bootstrap Version: 4.4(1)

    Hardware Version: 2.3 Model: WS-C2948 Serial #: JAB041808VK

    Mod Port Model Serial # Versions
    --- ---- ---------- -------------------- ---------------------------------
    1 0 WS-X2948 JAB041808VK Hw : 2.3
    Gsp: 5.5(2.0)
    Nmp: 5.5(2)
    2 50 WS-C2948G JAB041808VK Hw : 2.3

    DRAM FLASH NVRAM
    Module Total Used Free Total Used Free Total Used Free
    ------ ------- ------- ------- ------- ------- -------- ---- ----- -----
    1 65536K 34318K 31218K 12288K 8583K 3705K 480K 93K 387K

    Global checksum failed.

    Uptime is 2333 days, 13 hours, 33 minutes
    ----------------------

    thats 6 years 142 days.

    Global checksum failed indeed.
  • The NANOG Meta-argument

    This was posted to the  NANOG mailing list sometime in 2004 by Alex Bligh — it remains true….

    This argument (at least on NANOG) seems to be characterized by the following

    1. A suggests X, where X is a member of S, being a set of largely well known
       solutions.

    2. B1 … Bn, where n>>1 says X is without value as X does not solve
       the entire problem, each using a different definition of “problem”.

    3. C1 … Cn, where n>>1 says X violates a “fundamental principle of
       the internet” (in general without quoting chapter & verse as to
       its definition, or noting that for its entire history, fundamental
       principles, such as they exist, have often been in conflict, for
       instance “end-to-end connectivity”, and “taking responsibility for
       ones own network” in the context of (for instance) packets sourced
       from 127.0.0.1 etc.)

    4. D1 .. Dn, where n>>1 says X will put an enormous burden on some
       network operators and/or inconvenience users (normally without
       reference to the burden/inconvenience from the problem itself,
       albeit asymmetrically distributed, and normally without reference
       to the extent or otherwise that similar problems have been
       solved in a pragmatic manner before – viz route filtering, bogon
       filtering etc.)

    5. E1 .. En, where n>>1 insert irrelevant and ill-argued invective
       thus obscuring any new points in 1..4 above.

    6. Goto 1.

     

  • UPS have BIG batteries…

    I’m still working at the place mentioned earlier — I was only there for 3 months (actually one day less than 3 month, I know this because the recruiter only got his commission when I was there for at least three months, if I’d know this I would have stuck it out for another few days), but have more “funny” stories from this place than any other, anyway, onto the story:

    One of the server rooms becomes unusable and needs to be rebuilt[0], so everything needs to be migrated out of the existing room and into new space — this includes a large APC Symmetra UPS. We shut down the UPS and pull all of the batteries out of both it and the expansion shelves so that we can move it with a pallet lift. We move everything into the new space and its time to put the UPS back together. I quickly decide that lifting large numbers of heavy batteries into the shelves is not fun, so I show the random helper dude what to do… “You pick up this big, heavy thing and put in into this cubbyhole type spot, then you connect this large connector and slide the battery back, lather, rinse, repeat…”.

    I watch him do the first one and he seems to have it figured out… I wander off to go hook up some fiber or something and peer down the corridor every now and then to make sure he still has this under control. Surprisingly enough he is managing ok and hasn’t wandered off to take a nap or anything. He gets down to the last few batteries and seems to be having some issues, but I figure he’ll work it out, so I carry on with what I am doing… I peer down the corridor again and he is sitting on the floor with his back braced against something, pushing the battery into place with his feet… “Whoa, this can’t be good”, I think, just as there is a LARGE bang, a big flash and much smoke and fire….

    Turns out that for the last battery he managed to get the cables caught between the side if the battery and the side of the (sheet-metal) case. When it didn’t just slide easily back, he pushed it really hard and the edge of the case chomped through the cable creating a dead short — this literally vaporized a crescent of metal from the case around 5 inches in radius, flung bits of molten case and battery leads all over the place and ignited the cardboard that we put on the pallet to soften it…

    Much hilarity ensues…

     

    [0]: Have you ever noticed that places that use gas fire suppression systems either have doors that open outwards and / or big dampers (like http://www.c-sgroup.com/product_home.php?section=explovent&page=3) ? Ever wonder why? 🙂

  • Hiring cheap electricians is a bad idea…

    So I’m working at this place that is really cheap… Our CTO believes that it is stupid to pay for electricians that have experience working in datacenters, because after all, power is power, right?

    So, he calls a bunch of people in the Yellow Pages and hires the cheapest guy he can find. Said person arrives and looks a little goggle eyed at all the power stuff — I wander back in a few hours later and he is sitting in the middle of the floor reading the Users Manual for the UPS..

    Anyway, he manages to run the three new circuits for us without killing himself (although for some reason keeps switching the UPS between online and bypass) and then starts walking out the door… He stops at the door, looks at the big red glowing switch marked “Emergency Power Off” — and then pushes it….. Everything goes quiet, apart from Rob who got startled and dropped the shelf he was mounting onto his foot.

    After we got things turned back on we ask the electrician what exactly he was thinking… “Well, I figured the light was on because you were running on Emergency Power…”

  • Proximity card access systems

    Another posting from ComWest. The CTO is the sort of person who reads “Information Week” and similar magazines and, every month,  has to implement whatever the featured article is about — what I should have done was sign up for an expedited subscription and then, before he received his copy start suggesting we do whatever the topic that month was…

     

    Anyway, one month they had some article on the advantages of using a badge reader system, so, of course we suddenly had to have that. The CTO calls up a few places and is staggered by the cost — so, once again he finds the cheapest possible solution and buys it without doing any sort of research.
    The next few days are filled with some contractor running wires through the walls, accidentally setting off one of the sprinkler heads with a snake, etc. Eventually they finish up and install the machine that runs the whole system — its a Windows 95 box that they put in one of hte server rooms, show someone how to enrol badges and then leave — all goes well until the CTO decides that we need to install a generator so he hires “Generators-R-Us” or someone to install it… Of course they manage to drop power to the building for the better part of a day while trying to install it… The UPSs all run out of battery and the machines go down. They get power back and we go to help the servers come back up…

     

    There’s a little green light on the badge reader on the server room door, but nothing happens when I try badge in…. I go find the CTO and ask him for keys to the lock on the server room door — for “security” reasons he has put both the physical keys and the master override badge in the key-safe… in the server room….

     Its the day before the end of quarter and the finance folks are hopping up and down about closing books or something equally bizarre, so we get a hammer from the janitor and break a hole through the wall…….

     

    The Windows 95 machine is sitting on a black screen complaining about Floppy Drive A not working and please to push F1 to continue…