polybot

Architecture

From signal to execution: polybot's NNG pipeline

How a strategy signal becomes an on-venue order, traced end-to-end through polybot's async service mesh. The design trade-offs, and why NNG instead of Redis or in-process queues.

Published Apr 9, 2026


polybot’s services communicate via NNG (nanomsg-next-gen) pub/sub and push/pull channels, not direct function calls. New operators ask why. The answer is in what happens when a strategy emits a signal — traced through five processes in under a second.

The actors

All communicate through NNG sockets. Each is an independent OS process (or container) that can crash and restart without taking down the rest.

A trace

  1. T+0ms: Polymarket WebSocket emits a book update for market X.
  2. T+5ms: Scanner normalises to a BookUpdate, publishes on the prices PUB socket.
  3. T+6ms: arbitrage strategy (subscriber) receives it, computes edge. Edge > threshold. Emits Signal(market=X, side=YES, size=$80) on the signals PUSH socket.
  4. T+8ms: Risk service (PULL side) receives the signal. Checks limits. OK. Converts to Order. Pushes on orders PUSH socket.
  5. T+10ms: Executor pulls the order. Serialises to Polymarket’s CLOB format. Signs with EIP-712. Submits via py-clob-client.
  6. T+300ms: Venue ACKs (CLOB latency). Executor emits OrderAck on the executions PUB socket.
  7. T+350ms: Executor also pushes the paired NO leg (from the original Signal pair).
  8. T+600ms: Both legs fill. Executor emits Fill events. Analytics consumes, writes to DuckDB. Strategies receive, update internal state.

Total elapsed: ~600ms, dominated by on-chain venue latency. polybot’s in-process time is ~10ms.

Why NNG and not alternatives

Not direct function calls

A monolithic async app can call risk.check(signal) directly. Why split processes?

Not Redis pub/sub

Redis was considered. NNG wins because:

Not Kafka

Overkill. polybot’s message rate is < 10k msgs/sec at peak. Kafka is the right answer at 1M msgs/sec with replay requirements. We don’t have either.

Not ZeroMQ

NNG is a successor to ZeroMQ from the same author, fixing design issues (in-process threading, socket lifecycle). For greenfield Python code, NNG + pynng is the cleaner path.

Socket taxonomy

prices      PUB (scanner)  → SUB (strategies, analytics)
events      PUB (scanner)  → SUB (strategies, analytics)
signals     PUSH (strategies) → PULL (risk)                  [round-robin]
orders      PUSH (risk)    → PULL (executor)                 [round-robin]
executions  PUB (executor) → SUB (strategies, analytics)
control     REQ (cli)      → REP (services)                  [request/reply]

PUSH/PULL is used where every message must be processed exactly once (signals, orders). PUB/SUB is used where fan-out is desirable (prices, executions). REQ/REP is for the CLI to query state.

Backpressure

Each service publishes a lag metric to Prometheus. If the executor’s inbound queue grows beyond a threshold, strategies get a throttle signal on the control channel and slow down signal emission. This is cooperative backpressure — not bulletproof, but enough in practice. Hard backpressure (blocking publishers) was rejected because a slow executor should never block the scanner.

What this lets you do

What this costs you

The design takeaway

Pick your coupling at the right layer. For polybot, the right layer was “services communicate by messages, not calls”. It cost us complexity in deployment and won us fault isolation, hot-reload, independent scaling, and a trivial backtesting story.

If you’re building a real-time agent system — prediction markets, algorithmic trading, any domain with async venue I/O and strict risk invariants — it’s the layout worth copying.

Need an agent system built like this?

Cryptuon builds production AI agents, MCP integrations, and trading systems. polybot is our open-source showcase.