Along with @maciejwolczyk we’ve been training a neural network that learns how to play NetHack, an old roguelike game, that looks like in the screenshot. Recently, something unexpected happened.
A “fun” one I ran into was all our tests passing on my desk, but failing in the test farm
After a month, we realized that having an HDMI cable plugged into the unit was corrupting the SD card due to a memory overwrite in the graphics stack
The weirdest bug… is that there was no bug. We just didn’t know how the game worked.
Story of my entire programming life.
Their problem:
So apparently NetHack has a mechanic that slightly changes how the game plays every time it’s full moon according to your system clock
The model wasn’t trained on a full moon. They had a system to set up the environment for replicable results but it didn’t include modifying the system time.
It reminds me of another bug with the system time, which a friend of mine encountered. He was working on hardware and he was getting a lot of units that worked fine at the factory, immediately failed at the client’s location, and then worked again when they were returned to the factory. It turned out that when these machines were turned on, their embedded OS automatically queried some server to update the current time. The client’s internet connection had such high latency that the server’s response only came back after the machine was already in use. This generated a huge delta-t value that triggered the sanity checks and shut the machine down. The factory had a much lower-latency connection and so the race condition could never be replicated there.
As for the weirdest bug I ever encountered myself: a compiler generating bad machine code. I have often said that the worst part of programming is that the computer always does exactly what you tell it to, but that was the one and only time in twenty years that the computer actually didn’t.
That reminds me of The case of the 500-mile email
Warning: this is secretly a Nethack thread!
So, the model was playing on average 2,000 points worse because the player was luckier? The things about werewolves and dogs is a factor but is statistically insignificant.
Nethack has a couple of other gotchas like this. They should be grateful they weren’t playing on Friday the 13th…
Java was giving a no such method exception at runtime, but it compiled fine. Granted, that method was recently added to the class, but it was pretty simple and again, you’d expect the compiler to detect things like that.
Turns out the code I inherited from a not-great team had that class in two different places. Maven replaced the one I worked on with the untouched copy, which went into the build.
I had an issue where a client reported a crash on login. The exception and stack trace reported were very generic and lent no clues to the cause. I tried debugging but could not reproduce. I eventually figured out that the crash only happened for release (non-debug) builds that were obfuscated. I couldn’t find the troublesome code, so I figured out which release introduced the issue, then which commit, then went change by change until I was able to find the cause. It turned out to be a log message in a location that was completely unrelated to login. That exact log message was fine a few lines up. Other code worked fine in that location. For some unknown reason, having that log message in that specific location caused a crash in a completely different area of code.