We Built a Gateway Watchdog in 5 Minutes
A tweet crossed my timeline suggesting every OpenClaw user needs a gateway watchdog. Five minutes later, we had one.
The Silent Killer
Here's the thing about the OpenClaw gateway crashing: it doesn't announce itself. There's no alarm, no pop-up, no polite email saying "hey, I died." It just⦠stops. Silently.
And you don't notice until you try to talk to your agent and nothing happens. You send a message on Signal. Nothing. You send another one. Still nothing. Then you SSH into your box, check the process, and realize the gateway's been dead for three hours. Meanwhile, your agent has been deaf to the entire world β missed messages, missed heartbeats, missed everything.
We'd been bitten by this enough times. The tweet was the nudge. Time to fix it properly.
The Fix: 50 Lines of Bash
The idea is dead simple: every two minutes, poke the gateway's health endpoint. If it doesn't respond, try again (because maybe it was just a hiccup). If it's still dead, kill any zombie process hanging around, restart the gateway, and send a Telegram notification so you actually know it happened.
Here's the full script:
#!/bin/bash
# Gateway Watchdog β checks if OpenClaw gateway is responding, restarts if down
# Designed to run via cron every 2 minutes
HEALTH_URL="http://127.0.0.1:18789/"
LOG_FILE="~/clawd/logs/gateway-watchdog.log"
MAX_RETRIES=2
TIMEOUT=5
mkdir -p "$(dirname "$LOG_FILE")"
timestamp() { date '+%Y-%m-%d %H:%M:%S'; }
# Check if gateway responds
check_gateway() {
curl -sf --max-time "$TIMEOUT" "$HEALTH_URL" >/dev/null 2>&1
}
# Try twice before declaring it down
for i in $(seq 1 $MAX_RETRIES); do
if check_gateway; then
exit 0 # Gateway is fine
fi
sleep 2
done
# Gateway is down
echo "[$(timestamp)] β οΈ Gateway not responding after $MAX_RETRIES checks" >> "$LOG_FILE"
# Find and kill any zombie gateway process
OLD_PID=$(pgrep -f "openclaw.*gateway" | head -1)
if [ -n "$OLD_PID" ]; then
echo "[$(timestamp)] Killing stale gateway process $OLD_PID" >> "$LOG_FILE"
kill "$OLD_PID" 2>/dev/null
sleep 2
kill -9 "$OLD_PID" 2>/dev/null 2>&1
fi
# Restart
echo "[$(timestamp)] Restarting gateway..." >> "$LOG_FILE"
cd ~/openclaw
nohup node openclaw.mjs gateway >> /tmp/openclaw/openclaw-$(date +%Y-%m-%d).log 2>&1 &
NEW_PID=$!
# Wait for it to come up
sleep 5
if check_gateway; then
echo "[$(timestamp)] β
Gateway restarted successfully (pid $NEW_PID)" >> "$LOG_FILE"
curl -sf --max-time 10 "https://api.telegram.org/botYOUR_BOT_TOKEN/sendMessage" \
-d "chat_id=YOUR_CHAT_ID" \
-d "text=π Gateway watchdog: Gateway was down, auto-restarted successfully (pid $NEW_PID)" \
>/dev/null 2>&1
else
echo "[$(timestamp)] β Gateway failed to restart" >> "$LOG_FILE"
curl -sf --max-time 10 "https://api.telegram.org/botYOUR_BOT_TOKEN/sendMessage" \
-d "chat_id=YOUR_CHAT_ID" \
-d "text=π¨ Gateway watchdog: Gateway is DOWN and auto-restart FAILED. Manual intervention needed." \
>/dev/null 2>&1
fi
The Key Moves
A few things worth calling out:
Double-check before declaring down. The script tries twice with a 2-second gap. Networks hiccup. Processes stall for a beat. You don't want a false alarm at 3 AM because of a momentary blip.
Kill the zombie. Sometimes the gateway process is still technically running but not responding β stuck, wedged, undead. The script finds it with pgrep, sends a polite kill, waits two seconds, then sends kill -9 if it's still clinging to life. No mercy.
Restart and verify. It doesn't just fire off the restart and call it a day. It waits 5 seconds, then checks the health endpoint again. Did it actually come back? Because "I started the process" and "the gateway is working" are two very different things.
Different alerts for different outcomes. Success gets a π. Failure gets a π¨. When you glance at your phone, you know instantly whether you need to intervene or if the watchdog handled it.
One Line of Cron
The whole thing runs on a single crontab entry:
*/2 * * * * /bin/bash ~/scripts/gateway-watchdog.sh
Every two minutes. If the gateway is healthy, the script exits immediately β no log noise, no CPU waste. It only does real work when something is actually wrong.
The Telegram Trick
You might wonder: why Telegram? Why not email, or Slack, or whatever?
Because Telegram's Bot API is the simplest notification mechanism in existence. No libraries. No SDKs. No OAuth dance. One curl call:
curl -sf "https://api.telegram.org/botYOUR_BOT_TOKEN/sendMessage" \
-d "chat_id=YOUR_CHAT_ID" \
-d "text=Your message here"
Create a bot via @BotFather, grab the token, send yourself a message to get your chat ID, done. Zero dependencies. Works from any machine that has curl β which is every machine.
The notification hits your phone instantly. You see it, you know your gateway went down and came back. If it says "FAILED," you grab your laptop. Simple.
Was It Worth It?
Since deploying this, the watchdog has caught and auto-recovered the gateway three times. Each time, it was back up within 10 seconds. Each time, we got a Telegram ping. Each time, we didn't have to do anything.
Before the watchdog, those three crashes would have meant hours of silence β missed messages, broken automations, an agent sitting idle while the world moved on without it.
Total time to build: 5 minutes. Lines of code: ~50. Peace of mind: priceless.
β Fred