How fast CrossConnect goes, the settings you can turn, and a sizing model for a deployment. This guide shows where the cost actually lives, why supporting many operators at once stays cheap, and how much CPU, memory, and storage to set aside for the size of your fleet, the number of operators, and the length of history you keep.
This is a sizing reference, not a marketing sheet. Each section explains what is behind a number: the cache and how long it holds (its TTL), the lock in front of the config engine, the pool default, the meter you watch. The sizing tiers are conservative starting points, anchored to the measured reference workload at the back of this guide. Test at your own scale before going live.
CrossConnect is a single application process backed by a PostgreSQL database, plus an optional helper service for config analysis (a Batfish sidecar). Sizing it comes down to finding the few places where work is genuinely expensive and confirming the cheap paths stay cheap. Four facts frame every number here:
Most operator traffic is reads: loading dashboards and lists. The summaries operators ask for repeatedly are served from a short-lived in-memory cache, so adding more simultaneous operators adds CPU, not database load. That is why the measured workload held flat from 90 to 150 concurrent loads.
Analysis across the whole fleet runs against a single Batfish session, guarded by a fair lock. It is the only place where simultaneous requests wait in line rather than run side by side, which is why it moves onto its own host first as you grow.
CPU and memory are sized once per tier and rarely change. The database volume grows steadily as metric and flow history and config snapshots pile up, so the recurring cost decision is the storage plan, not the compute plan.
Discovery, webhook delivery, and the intent sweep run on small fixed-size worker pools on a fixed schedule. They do not grow with operator traffic, and discovery ships turned off, so they never surprise a sizing estimate.
The figure below is both the architecture and the sizing model. Each arrow is labelled with what makes that path cheap or expensive, because that is what you are sizing for. Cached summaries served from memory keep many operators cheap to support; writes and uncached lists go to PostgreSQL; analysis across the whole fleet is the one path that runs a single request at a time on a single engine session.
flowchart LR
subgraph CLIENTS["OPERATORS & INTEGRATIONS"]
direction TB
OP["Operators
dashboards · lists · UI"]
API["Integrations
/api/v1/* · rate-limited"]
end
subgraph APPNODE["CROSSCONNECT APPLICATION NODE"]
direction TB
APP["App process
Java 21 · Spring Boot 3.4 · Vaadin"]
CACHE[("In-memory caches
memoized · 15-60s TTL")]
POOL["HikariCP pool
sized per tier"]
end
PG[("PostgreSQL
system of record")]
BF["Batfish sidecar
single session · fair lock"]
OP -- "cached read (cheap)" --> APP
API -- "fixed-window limit" --> APP
APP -- "serve hot rollup" --> CACHE
APP -- "write / uncached list" --> POOL
POOL -- "JDBC / TLS" --> PG
CACHE -. "miss: fleet-read 60s TTL" .-> POOL
APP -- "analyze (queues on lock)" --> BF
BF -. "JDBC: per-device config read on miss" .-> PG
classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff;
classDef store fill:#e3f3f6,stroke:#1797b3,color:#173a6b;
classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b;
classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b;
class APP app;
class CACHE,PG store;
class BF gate;
class OP,API,POOL ext;
The sizing model follows directly from that path. You size four things, in order: the application node (CPU for handling many operators at once, memory for the data it keeps in flight), the database (connections matched to the pool, RAM for the active working set, disk for how long you keep history), the config engine (memory that grows with the size of the fleet, moved off the app node once it gets heavy), and the storage volume (the one number that keeps growing over time). The later sections cover each one in turn.
A Standard node handled 150 dashboard loads at once with memory near half a gigabyte because the expensive summaries are computed once and then reused. Several caches do this work. The dashboard report memos and both Batfish answer caches use a per-key single-flight lock, so even when many requests for the same cold entry arrive together, the result is computed exactly once, not once per request. The fleet-read cache is a plain short-TTL cache: it holds the rebuilt fleet config map for 60 seconds, which collapses the repeated per-question reads inside a single page load.
| Cache | What it holds | TTL | Why it matters for sizing |
|---|---|---|---|
| Fleet-read cache | Per-tenant device running-configs, rebuilt for each Batfish question | 60 s | A single page asks 5 or more config questions; without this cache the whole fleet would be read from PostgreSQL 5× per page load |
Report memos (TtlMemo) | Per-tenant dashboard rollups: data-quality, hotspots, maturity, occupancy, mDNS health, AV/QoS reports | 15–30 s | The most-loaded dashboard reads; served from memory between refreshes, so more operators add only CPU |
| Batfish result cache | Config-analysis answers, keyed by tenant + question + config hash | invalidate-on-change | A warm (already-computed) analysis answers in microseconds; the cost is paid once per config version |
| Batfish finding cache | Two layers: in-memory plus on-disk (keyed by analyzer version + config hash) | 30 d on disk | Survives a process restart, so a rolling upgrade does not have to redo cold config analysis |
All report memos are per-tenant. A multi-tenant deployment multiplies cache memory by the number of active tenants, but each entry is a small summary, so the total stays modest next to overall heap.
Formal analysis across the whole fleet (reachability, ACL, and IP-address-conflict questions) runs against a single Batfish session. This is the only place in the request path where simultaneous requests wait in line rather than run side by side, so it pays to know how it behaves under load and why it is the first thing to move off the app node.
flowchart LR R1["request A
config X"] --> SF{"single-flight
same config?"} R2["request B
config X"] --> SF R3["request C
config Y"] --> LK SF -- "collapse to one" --> LK["fair ReentrantLock
one Batfish session"] LK -- "analyze
12s request timeout" --> BF["Batfish sidecar
5s connect timeout"] BF -- "result" --> CACHE[("result + finding cache
warm = microseconds")] BF -. "error / timeout" .-> FB["heuristic fallback
crossconnect.batfish.fallback++"] classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff; classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b; classDef store fill:#e3f3f6,stroke:#1797b3,color:#173a6b; classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b; class BF app; class SF,LK,FB gate; class CACHE store; class R1,R2,R3 ext;
ReentrantLock so each finishes in order. A Batfish error or timeout
increments crossconnect.batfish.fallback and answers from a built-in rule-of-thumb estimate, so the
engine degrades gracefully instead of blocking.ReentrantLock lines up distinct analyses so each finishes
quickly and the cache fills cleanly. Watch crossconnect.batfish.analyze.lockwait to see how deep
the queue gets on the shared session.PostgreSQL is the single system of record. The application reaches it through a HikariCP connection pool, and the pool size is the main dial between how much the app does at once and how much load lands on the database. The build does not override Hikari, so the framework defaults apply until you set the pool size yourself for each tier.
| HikariCP property | Default (unset) | How to set it |
|---|---|---|
| maximum-pool-size | 10 | SPRING_DATASOURCE_HIKARI_MAXIMUM_POOL_SIZE (size per tier, §10) |
| minimum-idle | 10 (equals max) | leave at max; the app is steady-state, not bursty |
| connection-timeout | 30 s | framework default; pool waits show up as hikaricp.connections.pending |
| max-lifetime | 30 min | framework default |
max_connections to at least
(app pool size × number of app replicas) + 20, leaving the +20 for admin work and background jobs.
With one app replica on the Standard tier (pool 25) that is 25 + 20 = 45, so the Standard recommendation of 150
leaves plenty of room. Setting the pool larger than Postgres can serve does not help; it just moves the
bottleneck from the app to the database.Flyway (version 10.20.1) owns the database schema; Hibernate runs in validate mode and never
changes the schema on its own. Primary keys use UUIDv7 (time-ordered IDs), which keeps related rows close together
in the index on insert. That holds down index bloat and extra write work as history piles up.
Background work runs on small, fixed-size worker pools on a fixed schedule. None of it grows with how many operators are online, and the heaviest job (discovery) ships turned off, so it never inflates a sizing estimate unless you turn it on.
| Background job | Cadence / concurrency | Default | Tunable |
|---|---|---|---|
| Discovery sweep | fixed-rate, single-threaded: tenants one at a time, then devices one at a time | 5 min interval, 60 s initial delay, disabled | CROSSCONNECT_DISCOVERY_INTERVAL_MS, …_ENABLED |
| Webhook dispatch | async on a 4-thread worker pool; 6 attempts with exponential backoff (1s→1h cap) | 4 worker threads, 3 s per-call timeout | CROSSCONNECT_WEBHOOKS_WORKER_THREADS, …_TIMEOUT_MS |
| AI intent sweep | fixed-rate sweep of expired confirm-before-commit intents | 60 s rate, 30 s initial delay, 15 min intent TTL | CROSSCONNECT_AI_INTENT_SWEEP_FIXED_RATE_MS |
| Retention purges | scheduled staging / audit / sensor sweeps (chain-aware on the audit trail) | operator-set rolling windows | retention settings |
Integration (API) traffic on /api/v1/* is rate-limited: each tenant-and-IP pair gets a fixed
number of requests per time window. The operator UI is not rate-limited. The default is 100 requests per 60
seconds; raise it to your measured integration rate plus some headroom.
| Setting | Default | Purpose |
|---|---|---|
CROSSCONNECT_RATELIMIT_REQUESTS_PER_WINDOW | 100 | requests allowed per window per (tenant, IP) |
CROSSCONNECT_RATELIMIT_WINDOW_SECONDS | 60 | window length in seconds |
crossconnect.ratelimit.overrides | (none) | comma-separated tenantId=capacity per-tenant overrides |
A rejected request returns HTTP 429 with an RFC-7807 problem-detail body and a Retry-After header.
Today the counter lives in memory on each replica, so the first time you run more than one replica it has to
become shared state behind a common store (see §13).
Five numbers drive the model. Fill these in first; everything after derives from them.
| Input | What to enter | Drives |
|---|---|---|
| A · Managed devices sizing input | Fleet today plus 12-month growth (switches, routers, APs, firewalls) | Tier, RAM, engine heap, storage |
| B · Peak concurrent operators sizing input | The busy-hour peak, not total accounts. If you do not know it, use 5–10% of your operator headcount | App CPU, DB pool |
| C · History retention sizing input | How many months of metrics, flows, and change history to keep. If unsure, start at 6 | Database storage |
| D · Integration request rate sizing input | Automation / API calls per minute; 0 if none | Rate-limit window (§7) |
| E · Availability target sizing input | Single node, or no-downtime upgrades (multi-replica) | Deployment shape |
Find input A (devices) in the first column. That row is your tier for the rest of the model. If input B (operators) points you to a higher tier than A does, use the higher one: more operators at once add CPU, not memory.
| Tier | Devices (A) | Peak operators (B) | Deployment shape | Total RAM | Total vCPU |
|---|---|---|---|---|---|
| Pilot | up to 500 | up to 10 | Single node (all-in-one) | 8 GB | 4 |
| Small | up to 1,000 | up to 15 | Single node | 16 GB | 4 |
| Standard | up to 5,000 | up to 25 | Single node (larger) | 32 GB | 8 |
| Large | up to 10,000 | up to 50 | Two nodes (engine split out) | 64 GB | 16 |
| X-Large | up to 25,000 | up to 75 | Three+ nodes by role | 128 GB | 32 |
| Very large | 50,000+ | 100+ | Distributed (engage CybrIQ) | 256 GB+ | 64+ |
For your tier, build each piece to these specs. On Pilot, Small, and Standard they all run together on one node; from Large up they split apart. The figure shows how the layout changes as the config engine, the one place work runs one at a time, is pulled off the shared node.
flowchart TB
subgraph T1["PILOT / SMALL / STANDARD · single node"]
direction LR
A1["App"] --- D1[("PostgreSQL")]
A1 --- B1["Batfish
co-located"]
end
subgraph T2["LARGE · two nodes"]
direction LR
A2["App + PostgreSQL"] --- B2["Batfish
dedicated host"]
end
subgraph T3["X-LARGE+ · distributed by role"]
direction LR
LB["Load balancer"] --> A3["App replicas"]
A3 --> D3[("PostgreSQL
primary + replica")]
A3 --> B3["Batfish
cluster"]
end
T1 -.->|"fleet grows"| T2
T2 -.->|"fleet + operators grow"| T3
classDef app fill:#173a6b,stroke:#0f2a4f,color:#ffffff;
classDef store fill:#e3f3f6,stroke:#1797b3,color:#173a6b;
classDef gate fill:#fdf0dd,stroke:#e0892a,color:#173a6b;
classDef ext fill:#ffffff,stroke:#9aa8c0,color:#173a6b;
class A1,A2,A3 app;
class D1,D3 store;
class B1,B2,B3 gate;
class LB ext;
| Tier | vCPU | RAM | JVM max heap | DB pool | Notes |
|---|---|---|---|---|---|
| Pilot | 2 | 3 GB | -Xmx2g | 10 | Default footprint |
| Small | 2 | 4 GB | -Xmx3g | 15 | |
| Standard | 4 | 6 GB | -Xmx4g | 25 | More CPU for concurrent reporting |
| Large | 6 | 8 GB | -Xmx6g | 40 | Shares a node with the database |
| X-Large | 8 | 12 GB | -Xmx8g | 60 | Consider 2 replicas behind a load balancer |
Leave roughly 30% of node RAM above the heap for off-heap memory, metaspace, and the operating system. The reference node used about 0.5 GB of heap at 155 devices, so these figures are comfortable, not tight.
| Tier | vCPU | RAM | max_connections | shared_buffers | work_mem |
|---|---|---|---|---|---|
| Pilot | 2 | 3 GB | 50 | 1 GB | 16 MB |
| Small | 2 | 6 GB | 75 | 2 GB | 24 MB |
| Standard | 4 | 12 GB | 150 | 4 GB | 48 MB |
| Large | 6 | 16 GB | 200 | 6 GB | 64 MB |
| X-Large | 8 | 32 GB | 300 | 10 GB | 96 MB |
Connection math from §5: max_connections ≥ (pool × replicas) + 20. Use
JDBC over TLS (an encrypted connection) for any database link that is not on the same machine.
| Tier | Placement | Heap | Why |
|---|---|---|---|
| Pilot / Small | Co-located on the node | 2 to 4 GB | Light analysis load |
| Standard | Co-located, fixed heap | 8 GB (fixed) | Snapshots grow with the fleet |
| Large + | Dedicated host | 16 GB+ | A heavy analysis must not starve the app or database, and it runs one request at a time on a single session |
The engine is optional. If no sidecar is reachable, config-analysis answers fall back to a
built-in rule-of-thumb estimate (counted on crossconnect.batfish.fallback) instead of failing.
Set these on the application container. The first three are sized from §10; the rest are the same for every production deployment. The full tunables list is in Appendix A.
| Setting (environment variable) | Value | Purpose |
|---|---|---|
JAVA_TOOL_OPTIONS | -Xmx<heap>g (from 10a) | Application heap ceiling |
SPRING_DATASOURCE_HIKARI_MAXIMUM_POOL_SIZE | DB pool (from 10a) | Concurrent database connections |
CROSSCONNECT_RATELIMIT_REQUESTS_PER_WINDOW | 100 default, or input D + headroom | API requests per window per tenant per IP |
CROSSCONNECT_SECURITY_REQUIRE_SECRETS | true | Refuse to start in prod without secrets set |
CROSSCONNECT_AUTH_ADMIN_SECRET | a strong secret | Locks the admin API |
CROSSCONNECT_AUTH_SIGNING_SECRET | a strong secret | Stable login tokens and audit-chain signing across restarts |
SPRING_DATASOURCE_URL / USERNAME / PASSWORD | your database | System-of-record connection |
-Xmx<heap>g from 10a, or
-XX:MaxRAMPercentage=70 with the container memory limit set to the 10a RAM. Keep about 30% of the
limit free for off-heap memory, metaspace, and threads. If the limit is too small, the system's out-of-memory
killer ends the process; you do not get a tidy heap dump.Distributed tracing is always built in, but exporting the data is a runtime switch that is off
by default, so you do not need a collector to run. When you want traces, turn it on and point
OTEL_EXPORTER_OTLP_ENDPOINT at your collector (OpenTelemetry exporter 1.43.0).
The database storage in §10b is a starting point. It grows mainly with three things you control, and over time storage is the biggest cost driver.
| Driver | Effect | Lever |
|---|---|---|
| Metric & flow history (input C) | The fastest grower over time | Set a retention window; older samples are removed on the scheduled purge sweep |
| Change & audit history | Tamper-evident, append-only, hash-chained | A chain-aware purge trims data past the window while keeping the hash chain intact |
| Config snapshots | Grow with fleet size × how often configs change | Held in check by snapshot retention |
The endpoints and settings to wire into Kubernetes, Cloud Run, or your scheduler, plus the meters that warn you when a component is running out of headroom.
| Endpoint | Use | Notes |
|---|---|---|
GET /actuator/health/liveness | Liveness probe | Says the process is alive; fails only on a broken JVM. Do not tie this to dependencies. |
GET /actuator/health/readiness | Readiness probe | Says it can serve right now, including that the database is reachable. Send traffic only when this passes. |
GET /actuator/metrics/{name} | Read one meter | Readable with no extra infrastructure; the quick way to spot-check the Batfish meters. |
GET /actuator/prometheus | Metrics scrape | Prometheus-format metrics to scrape into your time-series database. Optional. |
startupProbe on /actuator/health/readiness with about a 30 s
budget (boot takes about 9 s), then separate readinessProbe and livenessProbe.
Keeping them separate stops the orchestrator from killing a healthy pod that is still warming up.maxUnavailable: 0, maxSurge: 1. A new replica serves
right away; its first dashboard read fills the cache under the single-flight memo (§3), the background
warmer fills it ahead of time, and the disk-backed finding cache means config analysis does not have to be
redone.| Meter | Healthy | Alert when | Action |
|---|---|---|---|
jvm.memory.used / max | < 70% | > 85% sustained | Raise heap / node RAM |
hikaricp.connections.pending | ~0 | > 0 sustained | Raise pool + Postgres max_connections |
hikaricp.connections.active | < max | at max sustained | Pool saturated; add connections |
http.server.requests p95 | at baseline | > 2× baseline | Add CPU or a replica |
system.cpu.usage | < 80% | > 90% sustained | Add CPU / scale out |
crossconnect.batfish.analyze.lockwait | low | rising sustained | Requests are queueing on the engine session; move the engine onto its own host |
crossconnect.batfish.fallback | 0 | > 0 | Config engine is degraded; check that Batfish is reachable and has enough heap |
-Xmx), how long requests
wait for a database connection (hikaricp.connections.pending near zero), p95 response time on the
dashboards and your busiest lists, and crossconnect.batfish.fallback (should stay at 0).Inputs: 1,200 devices growing to ~1,800 (A), 20 peak operators (B), 9 months retention (C), an integration sync at ~200 req/min (D), single node acceptable (E).
Tier: ~1,800 devices and 20 operators → Standard (covers up to 5,000 / 25).
The deploy (§§10–12):
-Xmx4g, DB pool 25.max_connections=150, shared_buffers=4GB, work_mem=48MB.300 (200 for the sync + headroom), require-secrets on, admin + signing secrets set.Validate: load test at 20 concurrent operators plus the 300/min sync; expect sub-100 ms dashboards
and near-zero pool waits, with crossconnect.batfish.fallback at 0, then go live.
The sizing above is anchored to a real, measured reference workload (a Standard-class node, 4 vCPU, 155 devices):
| Measure | Result |
|---|---|
| Dashboard home, 90 concurrent loads | ~15 ms (median) |
| Dashboard home, 150 concurrent loads | ~34 ms (median) |
| Device list (hundreds of rows), 60 concurrent | ~66 ms |
| Errors / throttled requests under load | 0 |
| Heap used at 155 devices | ~0.5 GB |
| Time to readiness on restart | ~9 s |
The full method and complete result set are in the CrossConnect Performance Report.
The settings that matter for sizing, with the values the build ships with. The defaults are chosen to work out of the box; a production deployment sets the heap, pool, secrets, and rate limit explicitly.
| Setting | Default | Notes |
|---|---|---|
| Server port | 8080 | SERVER_PORT; response compression on, 1 KB min |
| HikariCP max pool | 10 (framework) | SPRING_DATASOURCE_HIKARI_MAXIMUM_POOL_SIZE; size per tier |
| Hibernate ddl-auto | validate | Flyway 10.20.1 owns schema; Hibernate never mutates it |
| Fleet-read cache TTL | 60 s | per-tenant device configs; plain short-TTL cache |
| Report memo TTL | 15–30 s | per-tenant dashboard rollups; single-flight on miss |
| Batfish finding cache (disk) | 30 d | crossconnect.batfish.cache-dir (default temp dir) |
| Batfish connect / analyze timeout | 5 s / 12 s | fair-lock-serialized single session; fallback on timeout |
| Discovery sweep | 5 min, disabled | CROSSCONNECT_DISCOVERY_INTERVAL_MS / …_ENABLED; 20 s SSH timeout |
| Webhook workers / timeout | 4 / 3 s | CROSSCONNECT_WEBHOOKS_WORKER_THREADS / …_TIMEOUT_MS; 6 attempts, backoff to 1h |
| Rate limit | 100 / 60 s | CROSSCONNECT_RATELIMIT_REQUESTS_PER_WINDOW / …_WINDOW_SECONDS, per (tenant, IP) |
| AI intent TTL / sweep | 15 min / 60 s | confirm-before-commit intents; CROSSCONNECT_AI_INTENT_SWEEP_FIXED_RATE_MS |
| Session token TTL | 8 h | CROSSCONNECT_AUTH_TOKEN_TTL_HOURS |
| Tracing export | off | OTEL_EXPORTER_OTLP_ENDPOINT when enabled (OTel 1.43.0) |
The custom meters most useful for capacity work. Read them at GET /actuator/metrics/{name} with no
extra infrastructure, or scrape them from /actuator/prometheus. The standard framework meters
(jvm.*, hikaricp.*, http.server.requests, system.cpu.usage)
are exposed too.
| Meter | Type | What it tells you |
|---|---|---|
crossconnect.batfish.analyze | Timer | How long a per-config analyze takes end to end, including timeouts; the engine's latency profile |
crossconnect.batfish.analyze.lockwait | Timer | Time spent waiting in line for the shared session before analyze; rising means the engine is contended, so move it to its own host |
crossconnect.batfish.fallback | Counter | Analyses that fell back to the rule-of-thumb estimate (engine error or timeout); a sign of degradation, should be 0 |
hikaricp.connections.pending | Gauge | Requests waiting for a database connection; above 0 for a sustained period means the pool is too small |
hikaricp.connections.active | Gauge | Connections currently in use; at the max for a sustained period means the pool is full |
jvm.memory.used / max | Gauge | How full the heap is; alert above 85% of -Xmx |