How to run CrossConnect across several application copies (replicas) at once, so the service stays up and carries more load. We cover the load balancer, the application replicas, the shared state they coordinate through, the single PostgreSQL system of record, and the trap that catches every multi-replica rollout: scheduled jobs that fire once on every node.
This reference names the mechanism behind each claim, not a vague adjective. It tells you which state lives on a single replica, which is shared, which background job is safe to run at N copies and which is not, and which config key controls each one. Most CrossConnect deployments run a single application node, and the Performance & Capacity Planning Guide sizes that one node well into the thousands of devices. You move to more than one replica for two reasons: to survive a node failure or a rolling upgrade with no downtime, and to serve more requests or operators than one node should carry. The rest of this document is the reference shape for that move.
CrossConnect ships as a set of containers: one application image, PostgreSQL, and an optional Batfish analysis sidecar. You can run them under Docker Compose, Kubernetes, or a managed-database setup such as Cloud Run plus Cloud SQL. The application tier scales out (add more replicas), with one boundary you have to engineer on purpose. Four facts frame every decision below.
Dashboards serve from in-memory rollups; reads and writes go to PostgreSQL. No lasting per-user state is kept between calls, so a request can land on any replica. That is what lets the tier scale out.
Every replica points at the same PostgreSQL primary, exactly as in a single-node install. The connection budget is the number that bites at scale: the primary has to hold the sum of every replica's pool at once.
The API rate-limit counters and signed-in sessions are held in memory on each replica today. Across N replicas that becomes N separate copies, until you put a shared store in front of them or pin each session to one replica (session affinity).
Roughly twenty @Scheduled jobs
(discovery, drift, purges, report delivery) run on every replica on their own. The build has no leader election
today, so N replicas means N runs unless you gate them. This is the main trap when you scale out.
Here is the overall shape. A health-checked load balancer spreads traffic across identical application replicas. Every replica shares one PostgreSQL primary and one rate-limit / session store. The primary streams to an optional read replica that absorbs reporting and AI-retrieval reads, keeping that load off the primary. Configuration analysis runs against a Batfish sidecar reached on a separate path (out of band). Every edge below is labeled with its transport.
flowchart TB LB["Load balancer
L7 · TLS term · health checks
session affinity"] subgraph APP["APPLICATION TIER · stateless, scale 2..N"] direction LR A1["App replica 1
:8080 · UI · REST"] A2["App replica 2
:8080 · UI · REST"] AN["App replica N
:8080 · UI · REST"] end subgraph STORE["SHARED STATE · one of each"] direction LR PG[("PostgreSQL primary
:5432 · system of record")] RO[("Read replica
reporting · AI reads")] RS[("Shared counter / session store
rate limit · sessions")] end BF["Batfish sidecar
:8888 · single session
config analysis"] LB -- "HTTPS · affinity by cookie" --> A1 LB --> A2 LB --> AN A1 -- "JDBC / TLS" --> PG A2 --> PG AN --> PG PG -- "streaming replication" --> RO A1 -. "shared counters / sessions" .-> RS A2 -.-> RS AN -.-> RS A1 -- "RPC · out of band" --> BF A2 --> BF AN --> BF classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff; classDef store fill:#1797b3,stroke:#0d7d90,color:#ffffff; classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b; class A1,A2,AN app; class PG,RO,RS store; class LB,BF ext;
| Component | Role | Cardinality | State |
|---|---|---|---|
| Load balancer | L7 entry, TLS termination, health checks, session affinity | 1 (HA pair) | None (routing only) |
| App replica | Serves UI (:8080) and REST; identical image, identical config | 2 to N | Stateless request path |
| PostgreSQL primary | System of record: all writes and uncached reads (:5432) | 1 | Durable, authoritative |
| Shared counter / session store | Holds rate-limit counters for the whole fleet, and optionally sessions; not part of the default install, you add it | 1 (HA pair) | Ephemeral, shared |
| PostgreSQL read replica | Reporting, exports, and AI-retrieval reads | 0 to N | Replicated, read-only |
| Batfish sidecar | Config analysis (drift, reachability, change impact); single shared session | 1 host or pool | Snapshot-scoped |
The default ports come from the shipped configuration: the application listens on
server.port (8080), PostgreSQL on 5432, and the Batfish sidecar on
8888 via CROSSCONNECT_BATFISH_URL. The managed reference (Cloud Run) sets
containerConcurrency: 50 and autoscales from minScale: 1 to maxScale: 5. It
keeps one warm replica so requests do not pay the roughly five-second JVM cold start.
CrossConnect serves dashboards from in-memory rollups, and reads and writes PostgreSQL as its system of record. The request path itself keeps no lasting per-user state between calls, so a request can land on any replica. But three pieces of state are not local to a single replica, and most of this architecture exists to handle them correctly.
| State | Where it lives today | N-up consequence | What you do |
|---|---|---|---|
| System of record | One PostgreSQL primary, shared by every replica | None; it is already shared in a single-node install | Point all replicas at the same primary; size the connection budget (§10) |
| API rate-limit budget | Per replica in-memory fixed-window counters (RateLimitFilter) | The effective limit multiplies by the replica count | Front with a shared counter store, or accept the multiplied limit (§6) |
| UI sessions | Per replica in-memory; long-lived by default | A session is valid only on the replica that issued it | Session affinity at the LB, a shared session store, or both (§5) |
The configuration-analysis engine (Batfish) is shared too, but it is reached on a separate path and holds no per-request state, so it scales as its own pool rather than as part of the request path (§8). The background sweeps are a fourth concern, covered in §7.
flowchart LR
REQ["Operator / API request"] --> LB{"Load balancer
route by session cookie"}
LB --> R["Any app replica
stateless"]
R --> RL{"Rate-limit check
per-tenant · per-IP"}
RL -- "within budget" --> SES["Resolve session
affinity or shared store"]
RL -. "429 + Retry-After" .-> REJ["Rejected"]
SES --> SRV["Serve from in-memory rollups
or read / write PostgreSQL"]
classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff;
classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b;
classDef store fill:#e3f3f6,stroke:#1797b3,color:#173a6b;
classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b;
class R,SRV app;
class LB,RL gate;
class SES store;
class REQ,REJ ext;
When the rate-limit budget and the session both resolve against shared state, the operator sees one consistent limit and one continuous session no matter how the load balancer spreads their requests. Add or remove a replica and none of that changes. When they resolve against replica-local memory instead (the shipped default), the budget and session belong only to whichever replica answered. That is exactly why §5 and §6 exist.
Affinity: the load balancer pins each session to one
replica by cookie. This is the simplest option, but if that replica is lost, its pinned operators have to log in
again. Shared store: sessions live off-node, so any replica can serve any session and losing a replica is
invisible to users. Affinity plus a shared store gives you both locality and survivability. Sessions are
long-lived by default (close-idle-sessions: false), so affinity alone strands fewer users than a
short timeout would.
Point every replica at the same counter store so the budget applies across the whole fleet. The limiter is in-memory per replica in the shipped build. The code is structured so the bucket store can move behind a port, and a shared version plugs in the first time a deployment needs to scale out (§6). Without it, the real limit multiplies by the number of replicas.
All replicas write to one primary, so they share the
connection budget. Size the primary's max_connections for the sum of every replica's
SPRING_DATASOURCE_HIKARI_MAXIMUM_POOL_SIZE, plus room for replication, maintenance, and admin
sessions. See the connection math in §10.
Reporting, exports, and AI retrieval read a lot of data and can tolerate some delay. Stream them to a read replica to keep that load off the primary's write path. Start without it, and add it once reporting or AI reads begin to compete with interactive traffic. optional
This is the shared-state trap most teams hit first. The API rate limiter (RateLimitFilter, applied to
/api/v1/*) is a fixed-window counter kept per (tenant, IP) in an in-memory map on each replica. The default is
crossconnect.ratelimit.requests-per-window: 100 over
crossconnect.ratelimit.window-seconds: 60, which is 100 requests per minute per key. The source code
itself is clear that this is a single-instance design: “In-memory for v1, single-instance only. When the
first deploy needs horizontal scale, the bucket store moves behind a port and a Redis impl plugs in.”
The shared-counter implementation is a defined extension point, not yet shipped.
flowchart LR
subgraph NOW["DEFAULT · per-replica counters"]
direction TB
C1["Replica 1
budget 100/min"]
C2["Replica 2
budget 100/min"]
C3["Replica 3
budget 100/min"]
EFF["Effective fleet limit
100 × N per min"]
C1 --> EFF
C2 --> EFF
C3 --> EFF
end
subgraph FIX["SHARED · one counter store"]
direction TB
S1["Replica 1"]
S2["Replica 2"]
S3["Replica 3"]
SS[("Shared counter
one 100/min budget")]
S1 --> SS
S2 --> SS
S3 --> SS
end
classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff;
classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b;
classDef store fill:#1797b3,stroke:#0d7d90,color:#ffffff;
class C1,C2,C3,S1,S2,S3 app;
class EFF gate;
class SS store;
crossconnect.occupancy.api.rate-per-minute, default 120 per tenant).requests-per-window to one replica's share of the total you want. Per-tenant overrides
(crossconnect.ratelimit.overrides) apply per replica in the same way. Do not leave the default
unexamined at N replicas: the protection quietly weakens as you scale out.CrossConnect runs roughly twenty @Scheduled background jobs: discovery, reachability probing,
golden-config drift, report delivery, retention purges, and AI-intent expiry among them. Each one is a Spring scheduled
method that fires on a fixedRate or fixedDelay timer. The shipped build has no leader
election, ShedLock, or PostgreSQL advisory lock. On a single node that is correct and simple. At N replicas,
every timer fires on every replica, so each job runs N times per interval. This is the single most important thing
to engineer before you go multi-replica.
flowchart TB T(["Timer fires
fixedRate / fixedDelay"]) T --> R1["Replica 1
runs the sweep"] T --> R2["Replica 2
runs the sweep"] T --> R3["Replica 3
runs the sweep"] R1 --> DUP{"No leader gate
in the build"} R2 --> DUP R3 --> DUP DUP --> X["N× discovery probes · N× report emails
racing deletes on purge sweeps
N× drift analysis"] classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b; classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff; classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b; class R1,R2,R3 app; class DUP gate; class T,X ext;
| Sweep | Default cadence | N-up effect if ungated |
|---|---|---|
| Discovery worker | discovery.interval-ms 5 min (off by default) | N-fold probing of every device; redundant load on the network |
| Reachability collector | health.reachability.interval-ms 2 min | N-fold ICMP/TCP probes; N observations per device per tick |
| Golden-config drift sweep | goldenconfig.drift-sweep-fixed-rate-ms 15 min | Duplicate analysis; N hits on the shared Batfish session |
| Scheduled report sweep | reporting.sweep-fixed-rate-ms 60 s | Duplicate report deliveries (N copies of each email) |
| AI write-intent sweep | ai.intent-sweep-fixed-rate-ms 60 s | Harmless duplication; idempotent expiry of proposals past TTL |
| Staging / audit / webhook / device purges | 24 h each | Replicas race to delete the same rows; redundant, not corrupting |
| Batfish health probe | fixedDelay 60 s | N pings of the sidecar; benign, read-only |
| Multicast / AV-drift / occupancy sweeps | 5 min to 1 h | Duplicate scans and snapshot writes per tick |
Most timers are already held back by configuration that is off by default: discovery
(crossconnect.discovery.enabled: false) and automation
(crossconnect.automation.enabled: false) do not run until you enable them. So a worker-role split is
often just a matter of turning them on for the worker replica and leaving them off everywhere else.
The Batfish sidecar runs a single shared analysis session. If several /analyze calls hit one session
at once, they line up and thrash a single snapshot, so each call can run past its timeout. CrossConnect handles this
inside each replica with two mechanisms in BatfishConfigBackend. A fair ReentrantLock
queues analyze calls so each one finishes quickly and returns real results, rather than falling back to a rough
estimate after a timeout. A SingleFlight coalescer merges identical-config requests that arrive at the
same time into one computation. Results are cached and addressed by the config's SHA-256 hash in a persistent
finding cache, so repeat questions skip the engine entirely.
flowchart LR
subgraph R["Each replica"]
direction TB
REQ["Analyze requests
warmer + UI · per device"]
SF{"SingleFlight
collapse identical configs"}
LK{"Fair ReentrantLock
serialize one at a time"}
REQ --> SF --> LK
end
CACHE[("Finding cache
keyed by config SHA-256")]
BF["Batfish sidecar
single shared session"]
LK -- "cache hit" --> CACHE
LK -- "cache miss · /analyze" --> BF
BF --> CACHE
classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff;
classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b;
classDef store fill:#1797b3,stroke:#0d7d90,color:#ffffff;
classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b;
class REQ app;
class SF,LK gate;
class CACHE store;
class BF ext;
/analyze for the same config at once, the sidecar still runs
them one at a time, but there is no merging of identical requests across replicas.crossconnect.batfish.analyze.lockwait and the
crossconnect.batfish.fallback counter to spot contention before users feel it.Every in-memory cache in CrossConnect is per-replica; the build has no distributed cache. That is on purpose and almost always fine, because each cache is either time-bounded (it expires on a TTL) or hash-addressed: two replicas may briefly hold different values, but neither holds a wrong one for long.
| Cache | What it holds | Keying / TTL | N-up behaviour |
|---|---|---|---|
Dashboard rollups (TtlMemo) | Network atlas, hotspots, data-quality scorecard | TTL-bounded, single-flight per key | N independent caches; bounded staleness, can disagree for the TTL window |
| Batfish finding cache | Analysis results | Content-addressed by config SHA-256; no TTL, purged at 30 days | Stable across replicas and restarts for identical config; safe to share on disk |
| SingleFlight coalescer | In-flight expensive computations | Cleared on completion | Coalesces within a replica only; N replicas may each compute the same miss once |
| Discovery run history | Recent discovery runs (UI view) | In-memory deque | Each replica shows only its own runs; expected once sweeps are worker-gated |
Because the finding cache keys on a plain hash of the configuration text, it stays stable across replicas and restarts: the same config always produces the same key. So a shared on-disk cache directory lets replicas reuse each other's analysis results without any coordination. The time-bounded dashboard rollups are the only caches where replicas can visibly disagree, and only for the length of the TTL.
This is the one number that bites at scale. Each replica opens its own HikariCP connection pool
(SPRING_DATASOURCE_HIKARI_MAXIMUM_POOL_SIZE), and the primary has to hold all of those pools at once,
plus a reserve for replication, maintenance, and admin sessions. Keep each replica's pool to what one node needs
(10 to 25, per the Capacity Planning tiers), not what the whole fleet needs.
| Replicas | Pool per replica | App connections | Reserve (repl. + admin) | Primary max_connections |
|---|---|---|---|---|
| 2 | 20 | 40 | 20 | 100 |
| 3 | 20 | 60 | 20 | 100 |
| 4 | 20 | 80 | 25 | 150 |
| 6 | 15 | 90 | 30 | 150 |
| 8 | 15 | 120 | 40 | 200 |
max_connections without limit. A few
hundred real PostgreSQL connections is a practical ceiling, and the pooler lets many replica pools share them. The
Capacity Planning Guide's per-tier max_connections (150 at the Standard tier) is the single-node
figure; this table is what replaces it once you scale out.flowchart LR V1["Replica @ v1
serving"] --> DR["Drain
readiness → false
LB stops routing"] DR --> RP["Replace @ v2
warm caches in background"] RP --> RD{"Readiness probe
warm complete?"} RD -- "ready" --> V2["Replica @ v2
serving"] RD -. "not yet" .-> RP classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff; classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b; classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b; class V1,V2 app; class RD gate; class DR,RP ext;
/actuator/health/readiness probe includes a check that the database is reachable;
/actuator/health/liveness reports process health. The load balancer and orchestrator both watch
readiness, so traffic never reaches a cold or draining replica.maxUnavailable: 0, maxSurge: 1 brings a new replica up before retiring an old
one, so capacity stays flat through the rollout.maxScale: 5 with containerConcurrency: 50.DashboardCacheWarmer and, where Batfish is reachable, BatfishWarmer). Readiness gating
hides that from users; time to readiness on restart is roughly nine seconds at the reference workload.BatfishWarmer queues against the shared
sidecar session at the same time. A guard inside each replica keeps a warm from overlapping itself, but there is no
coordination across replicas. That is another reason to concentrate the heavy Batfish callers on a single worker
role (§7, §8).
flowchart TB
subgraph F["Each shared component fails into a bounded state"]
direction TB
AR["App replica down
health check fails"] --> ARX["LB reroutes to healthy replicas
affinity sessions re-auth"]
PGF["PostgreSQL primary down"] --> PGX["Writes pause · cached reads continue
promote hot standby"]
BFF["Batfish down"] --> BFX["available = false · flagged in UI
all other reads / writes unaffected"]
RSF["Counter / session store down"] --> RSX["Rate limit falls open per-replica
shared sessions re-authenticate"]
end
classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff;
classDef store fill:#1797b3,stroke:#0d7d90,color:#ffffff;
classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b;
classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b;
class AR,PGF app;
class BFF,RSF gate;
class ARX,PGX,BFX,RSX ext;
| What fails | What happens | What you do |
|---|---|---|
| An app replica | Its health check fails, so the load balancer stops routing to it. Affinity-pinned sessions log in again, or continue without interruption if sessions are in a shared store. | The orchestrator replaces it, and capacity restores automatically. |
| PostgreSQL primary | Writes pause until a standby is promoted. In the meantime, cached reads keep serving from replica memory. | Promote the hot standby and repoint the replicas. Use managed failover where available. |
| Batfish sidecar | Config analysis reports available = false and is flagged in the UI, and analyze calls fall back to a rough estimate (counted on crossconnect.batfish.fallback). All other reads and writes are unaffected. | Restart the engine, and analysis resumes. No data is lost. |
| Counter / session store | Rate limiting falls back to the per-replica in-memory counters so requests are still served; shared sessions fall back to logging in again. | Restore the store (an HA pair makes this rare), and counters and sessions go back to sharing. |
Every replica emits the same metrics it does on a single node, exported at /actuator/prometheus, with
OpenTelemetry tracing always wired and span export gated by a runtime toggle
(CROSSCONNECT_TRACING_EXPORT_ENABLED, off by default). The extra work at scale is
aggregation: tag each series by replica so you can see both the fleet total and any single replica drifting away
from the pack.
jvm.memory.used vs max), request latency
(http.server.requests p50/p95/p99), cache hit rate, DB pool in-use
(hikaricp.connections.active / pending), and time to readiness on restart.max_connections, replication lag
to the read replica, counter/session store availability, and Batfish contention
(crossconnect.batfish.analyze, …analyze.lockwait, …fallback).| Area | Confirm before go-live |
|---|---|
| Single-node sizing | Each replica is sized per the Capacity Planning Guide for its share of the fleet. |
| Rate limiter | Either a shared counter fronts RateLimitFilter so the fleet limit measures correctly, or requests-per-window is set to the per-replica share of the intended total. |
| Sessions | Affinity and/or a shared session store configured; a replica loss does not strand signed-in operators beyond your tolerance. |
| Scheduled sweeps | A single-runner gate is in place: a dedicated worker role or a distributed lock. Discovery and report delivery do not run on more than one replica. |
| Batfish | Heavy analyze callers (warmer, drift sweep) concentrated on one worker role; operator replicas read the shared cache; lockwait and fallback are within tolerance. |
| Database | max_connections covers the sum of replica pools plus reserve; a pooler is in place past ~6 replicas. |
| Readiness | The readiness probe gates traffic; a draining or cold replica receives none. |
| Rolling upgrade | Surge policy holds capacity flat; a full rollout completes with no measured downtime. |
| Failure drills | Replica kill, primary failover, Batfish stop, and counter-store stop each behave as §12 describes. |
| Observability | Per-replica and aggregate dashboards are live; shared-dependency and sweep-duplication metrics are alerting. |