Building High Availability: Early Lessons from Distributed Systems at Motorola

The network management system we built at Motorola was critical infrastructure. When it went down, network engineers lost visibility across hundreds of devices. Downtime was not acceptable. This was 1997: no AWS, no managed load balancers, no Kubernetes. High availability meant designing it yourself.

Here is the architecture we built and what we learnt from running it.

The Failure Modes We Designed Against

Before writing a line of HA code, we listed what could fail:

Process crash — the NMS JVM exits unexpectedly.
Host failure — the server hardware fails or the OS crashes.
Network partition — the NMS server cannot reach devices even though both are running.
Database failure — the configuration database is unavailable.

Each failure mode required a different response. A process crash is recoverable in seconds. A host failure requires failover to a standby. A network partition requires split-brain handling.

Primary–Standby with Heartbeat

The core HA pattern was a primary/standby pair. The primary handled all work. The standby mirrored state and monitored the primary. When the primary failed, the standby promoted itself.

The heartbeat mechanism:

public class HeartbeatSender extends Thread {
    private final DatagramSocket socket;
    private final InetAddress    standbyAddress;
    private final int            standbyPort;
    private final int            intervalMs = 1000;

    public HeartbeatSender(String standbyHost, int port) throws Exception {
        this.socket         = new DatagramSocket();
        this.standbyAddress = InetAddress.getByName(standbyHost);
        this.standbyPort    = port;
    }

    @Override
    public void run() {
        while (!isInterrupted()) {
            try {
                byte[] payload = buildHeartbeat();
                DatagramPacket pkt =
                    new DatagramPacket(payload, payload.length, standbyAddress, standbyPort);
                socket.send(pkt);
                Thread.sleep(intervalMs);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            } catch (IOException e) {
                // log — heartbeat failure is monitored separately
            }
        }
    }

    private byte[] buildHeartbeat() {
        // timestamp + sequence number + current load metrics
        long   now = System.currentTimeMillis();
        return ("HB:" + now).getBytes();
    }
}

The standby's heartbeat monitor:

public class HeartbeatMonitor extends Thread {
    private final DatagramSocket socket;
    private volatile long        lastHeartbeat;
    private final long           timeoutMs = 5000; // 5 seconds

    public HeartbeatMonitor(int listenPort) throws Exception {
        this.socket        = new DatagramSocket(listenPort);
        this.lastHeartbeat = System.currentTimeMillis();
    }

    @Override
    public void run() {
        new Thread(this::checkTimeout).start();
        byte[]         buf = new byte[256];
        DatagramPacket pkt = new DatagramPacket(buf, buf.length);
        while (!isInterrupted()) {
            try {
                socket.setSoTimeout((int) timeoutMs);
                socket.receive(pkt);
                lastHeartbeat = System.currentTimeMillis();
            } catch (SocketTimeoutException e) {
                // handled by checkTimeout thread
            } catch (IOException e) {
                Thread.currentThread().interrupt();
            }
        }
    }

    private void checkTimeout() {
        while (!Thread.currentThread().isInterrupted()) {
            try {
                Thread.sleep(1000);
                if (System.currentTimeMillis() - lastHeartbeat > timeoutMs) {
                    triggerFailover();
                }
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }
    }

    private void triggerFailover() {
        // promote this node to primary
        // take over the virtual IP address
        // restart inbound connections
    }
}

State Replication

A standby is only useful if its state matches the primary. We replicated state changes over TCP using a simple log-shipping mechanism:

public class StateReplicator {
    private final Socket         connection;
    private final PrintWriter    out;
    private final long           batchIntervalMs = 100;
    private final List<String>   pendingOps      = new ArrayList<>();

    public StateReplicator(String standbyHost, int port) throws IOException {
        this.connection = new Socket(standbyHost, port);
        this.out        = new PrintWriter(connection.getOutputStream(), true);
    }

    public synchronized void recordUpdate(String deviceIp, DeviceStatus status) {
        pendingOps.add("UPDATE:" + deviceIp + ":" + status.toString());
    }

    public synchronized void flush() {
        for (String op : pendingOps) {
            out.println(op);
        }
        pendingOps.clear();
    }
}

We flushed every 100ms. This meant at most 100ms of state loss on failover — acceptable for a monitoring system.

Virtual IP for Transparent Failover

Clients connected to a virtual IP address that was normally held by the primary. On failover, the standby sent a gratuitous ARP to claim the virtual IP. Clients' TCP connections dropped but reconnected immediately to the standby, now owning the virtual IP.

This was entirely handled at the OS/network layer — no application code required. The concept is identical to what cloud providers call a floating IP or an elastic IP.

The Lessons

Test failover regularly. We ran failover drills every month. The first drill found three bugs. Untested failover paths fail when you need them.

Fail fast, recover fast. Better to restart a crashed process in two seconds than to limp along in a degraded state. We used a supervisor process that monitored the NMS JVM and restarted it on exit.

Distinguish planned from unplanned downtime. Maintenance windows, software upgrades, host reboots — these are planned. Design the system to remain available during planned operations, not just failures.

Accept that distributed systems fail. The goal is not zero downtime — it is fast, automatic recovery. Design around that reality.