Debugging Distributed Systems: When the Error Is Three Hops Away

In a monolithic application, debugging is uncomfortable but tractable. You have one heap, one thread dump, one log file. Something throws an exception and the stack trace tells you exactly where.

In a distributed system you have five hosts, ten JVMs, and an exception that says Connection reset by peer. The real cause might be on a different machine entirely, have happened 30 seconds earlier, and left no visible trace in the log you are looking at.

At Motorola our NMS ran across five hosts. By 1997 I had developed a systematic approach to distributed debugging that replaced intuition with evidence gathering.

The First Rule: Correlate Before You Guess

The most common mistake is to look at one log, form a hypothesis, and start changing code. On a distributed system this wastes hours. The logs across all hosts tell a story together that none tells alone.

Our first step on any production issue was to collect logs from all hosts, normalise the timestamps, and merge them into a single timeline:

# On each host, extract the relevant time window and prefix with hostname
ssh nms-primary  "grep '1997-08-22 14:3' /var/log/nms.log" | sed 's/^/primary:  /'
ssh nms-standby  "grep '1997-08-22 14:3' /var/log/nms.log" | sed 's/^/standby:  /'
ssh nms-poller1  "grep '1997-08-22 14:3' /var/log/nms.log" | sed 's/^/poller-1: /'
ssh nms-poller2  "grep '1997-08-22 14:3' /var/log/nms.log" | sed 's/^/poller-2: /'

Sort by timestamp and you see the sequence of events across the whole system. Causality becomes visible.

Clock Skew Is Your Enemy

Distributed logs are only mergeable if the clocks agree. In 1997 NTP was not universally deployed. We had two hosts whose clocks drifted by up to 8 seconds relative to each other.

Eight seconds of clock skew means events that happened in sequence appear interleaved or reversed in a merged log. We learnt to check clock sync before trusting any timeline:

# Check time on all hosts
for host in nms-primary nms-standby nms-poller1 nms-poller2; do
    echo "$host: $(ssh $host date)"
done

After the first incident caused by clock skew, we added NTP synchronisation to all hosts and added the hostname to every log line so we could detect skew during post-mortems.

Thread Dumps for Hung JVMs

When a JVM stopped responding but did not crash, the first tool was a thread dump. On JDK 1.1:

# Send SIGQUIT to the JVM process — dumps all thread stacks to stderr
kill -QUIT $(cat /var/run/nms.pid)

The output showed every thread and what it was doing:

Full thread dump:

"poller-3" (TID:0x..., sys_thread_t:0x..., state:MW) prio=5
    at java.net.SocketInputStream.socketRead(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:86)
    at com.motorola.nms.SnmpClient.readResponse(SnmpClient.java:142)
    at com.motorola.nms.SnmpV2PollStrategy.poll(SnmpV2PollStrategy.java:67)
    at com.motorola.nms.DevicePoller.run(DevicePoller.java:34)

"poller-7" (TID:0x..., sys_thread_t:0x..., state:MW) prio=5
    at java.net.SocketInputStream.socketRead(Native Method)
    ...

state:MW means "monitor wait" — blocked on I/O. All eight pollers were blocked reading from sockets that had stopped responding. The issue was not in our code — a network switch had silently dropped TCP connections without sending RST packets, leaving our sockets open but dead.

The fix was socket timeouts, which we should have had from the start.

Narrowing with Binary Search

When you cannot reproduce a bug locally and must debug in production, binary search through the code path is faster than reading every line:

Identify the entry point and exit point of the suspected code path.
Add a log at the midpoint: did we reach it?
Yes: the bug is in the second half. No: it is in the first half.
Repeat.

In a JVM you can do this without deploying new code using conditional logging:

// Add a system property to enable verbose logging at runtime
// java -Dnms.debug.poller=true ...
if (Boolean.getBoolean("nms.debug.poller")) {
    Logger.debug("DevicePoller", "Before SNMP call for " + ip);
}

Restart the specific JVM with the property set, reproduce, and check the log. No rebuild required.

The Heisenbug: Problems That Disappear Under Observation

Some bugs disappear when you add logging because the logging introduces synchronisation or changes timing. We had one race condition that vanished whenever we added debug logging, and reappeared when we removed it.

The approach for timing-dependent bugs:

Add lightweight logging that does not change thread scheduling — write to a circular buffer in memory, dump it only on failure.
Look at the code for shared mutable state accessed without synchronisation — the race is almost always there.
Use synchronized conservatively to eliminate candidates one at a time.

For our bug, the extra latency from file I/O in the logging path was masking a race between the poller thread and the alert thread updating the same DeviceStatus object without synchronisation. The fix was straightforward once we stopped relying on luck.

What a Distributed Debugger Would Look Like

In 1997 we had no distributed tracing. What we needed — and what did not exist until Dapper (2010) and Zipkin (2012) — was a correlation ID attached to each request that propagated across all service calls and appeared in every log line. With that, you would search for the ID and see the entire request journey across all hosts in one view.

We built a rudimentary version: a random request ID generated at entry to each operation, passed through method calls, logged at every step. It was not propagated across network calls, which limited its usefulness. The concept was sound; the infrastructure to support it came later.