High Availability

ArcadeDB supports a High Availability (HA) mode where multiple servers share the same databases via Raft-based replication. All servers of a cluster serve the same set of databases.

Starting with v26.4.1, HA is powered by Apache Ratis, a production-grade implementation of the Raft consensus protocol. The old custom replication protocol has been removed. For the underlying concepts, see High Availability Concepts.

Quick Start

To start an ArcadeDB server with HA enabled, at minimum you need to:

  1. Set arcadedb.ha.enabled to true.

  2. Define the list of peers in the cluster via arcadedb.ha.serverList (if you are deploying on Kubernetes, see Kubernetes). The value is a comma-separated list of entries in the format [name@]host:raftPort:httpPort[:priority]. The optional name@ prefix is described in Named Peers.

  3. Optionally set the local server name via arcadedb.server.name. Each node must have a unique name. If not specified, the default is ArcadeDB_0.

Example starting a 3-node cluster on a single host (different ports):

$ bin/server.sh -Darcadedb.ha.enabled=true \
                -Darcadedb.server.name=node1 \
                -Darcadedb.ha.serverList=localhost:2434:2480,localhost:2435:2481,localhost:2436:2482

The Raft gRPC port is 2434 by default and is configurable via arcadedb.ha.raftPort. The HTTP port in each entry is required so that replicas can forward non-idempotent requests to the leader over HTTP.

The cluster name is arcadedb by default; set arcadedb.ha.clusterName=<name> to run multiple independent clusters on the same network.

Named Peers (from v26.5.1)

By default, each node identifies its slot in arcadedb.ha.serverList from the numeric suffix of arcadedb.server.name (for example, ArcadeDB_0 maps to the first entry, ArcadeDB_1 to the second, and so on). Display names shown in logs and Studio are then synthesized from the local node’s prefix. This positional convention works well for Kubernetes StatefulSets but is awkward when nodes have human-readable names such as frankfurt, london, nyc.

You can add an optional name@ prefix to any entry in arcadedb.ha.serverList to give each peer a stable, human-readable name:

$ bin/server.sh -Darcadedb.ha.enabled=true \
                -Darcadedb.server.name=frankfurt \
                [email protected]:2434:2480,[email protected]:2434:2480,[email protected]:2434:2480

When peer names are configured, each node identifies its slot by matching arcadedb.server.name against the configured peer names first; if no match is found, it falls back to the legacy prefix_N / prefix-N suffix resolution. This means:

  • Server names no longer require a numeric suffix when peer names are used.

  • Display names shown in logs and Studio reflect the configured peer name (e.g. frankfurt (10.0.0.1:2480) instead of ArcadeDB_0 (10.0.0.1:2480)).

  • Mixed clusters work: entries with name@ get their explicit name; entries without it fall back to the existing positional synthesis.

  • Peer names must be unique within the cluster.

The Raft peer ID is still derived from the address (e.g. 10.0.0.1_2434), so changing or omitting peer names does not affect Raft identity stability.

Architecture

ArcadeDB uses a leader/replica model with Raft consensus. At any time one server holds leadership and accepts writes; the others are replicas that serve reads and stand by for failover.

ArcadeDB_0 Leader ArcadeDB_1 Replica ArcadeDB_2 Replica Raft Log durable Raft Log durable Raft Log durable append entries (gRPC) append entries (gRPC)

Each server persists its own Raft log segments under <rootPath>/raft-storage/. The log is used for recovery after restart and to replicate state to peers that fall behind.

Any read (query) can execute on any server in the cluster. All writes must go through the leader: a replica that receives a write transparently forwards it to the current leader via HTTP.

Client A ArcadeDB_1 Replica (1) read request (2) result Client A ArcadeDB_1 Replica ArcadeDB_0 Leader (1) write request (2) HTTP proxy (3) 3-phase commit on leader (4) replicate via Raft (5) response (6) response

Internally, every committed write follows a replicate-first, commit-after three-phase commit:

  1. Phase 1 (read lock): capture the WAL pages produced by the transaction.

  2. Phase 2 (no lock): append the payload to the Raft log and wait for quorum acknowledgment.

  3. Phase 3 (read lock): apply the pages to the local database and return to the client.

If Phase 2 times out or fails, Phase 3 never runs: no local writes, no divergence. Multiple concurrent transactions are batched into fewer Raft round-trips via group commit (HA_RAFT_GROUP_COMMIT_BATCH_SIZE).

Write Quorum

A quorum is the number of peers that must acknowledge a write before it commits. Configure it via arcadedb.ha.quorum:

  • majority (default) — a majority of peers must acknowledge (standard Raft).

  • all — every configured peer must acknowledge.

If the configured quorum is not met within arcadedb.ha.quorumTimeout milliseconds, the transaction is rolled back and an error is returned to the client.

Starting with v26.4.1, the legacy quorum values none, one, two, three are no longer supported and will cause a startup error. If you were using any of those values, update your configuration to majority or all before upgrading.

Read Consistency (from v26.4.1)

When a read runs on a replica, ArcadeDB offers three consistency levels, configurable per-server via arcadedb.ha.readConsistency:

Level Behavior

eventual

Read locally without waiting. Fastest, but may return data that was committed on the leader after the replica’s last apply.

read_your_writes (default)

The replica waits until the Raft log index corresponding to the client’s most recent write has been applied locally before serving the read.

linearizable

The replica issues a Raft ReadIndex request to the leader and waits until its local applied index reaches the leader’s committed index for that request. Strongest guarantee, highest latency.

Clients communicate the consistency contract through three HTTP headers:

Header Direction Purpose

X-ArcadeDB-Read-Consistency

request

Overrides the server default for this request. Accepts eventual, read_your_writes, or linearizable.

X-ArcadeDB-Read-After

request

Client-supplied bookmark: the Raft commit index the replica must have applied before serving the read. Used to implement read-your-writes across connections.

X-ArcadeDB-Commit-Index

response

Current last-applied commit index, echoed on every response. Clients capture this value and send it back as X-ArcadeDB-Read-After on subsequent reads.

Older clients that sent X-ArcadeDB-Commit-Index on requests are still accepted as a backward-compatible fallback.

Load Shedding and Backpressure (from v26.4.1)

The Raft group-commit queue is bounded (arcadedb.ha.groupCommitQueueSize, default 10000 pending transactions). When a write arrives but the queue is full, the server waits up to arcadedb.ha.groupCommitOfferTimeout ms (default 100) for a slot, and if still full throws ReplicationQueueFullException — a NeedRetryException that clients automatically retry with backoff. This protects the server from OOM under sustained write overload and lets clients back off gracefully.

Witness / Read-Scale Nodes (from v26.4.1)

Set arcadedb.ha.serverRole=replica to pin a node as a permanent replica. The node’s Raft priority is set to 0 at startup so it is never elected leader, while still receiving all writes and serving reads. Useful for read-scale deployments or for cross-DC witnesses that should not take over as leader.

Automatic Failover

If the leader becomes unreachable, replicas start a new Raft election. A replica with an up-to-date log is elected as the new leader and the cluster resumes serving writes. Pre-vote prevents partitioned nodes from triggering disruptive elections.

Common causes of leader unavailability include:

  • The ArcadeDB server process has been terminated.

  • The physical or virtual host has been shut down or rebooted.

  • Network issues prevent the leader from reaching a majority of peers.

When a replica rejoins after being offline, it catches up via the Raft log automatically; if it has fallen behind past the log purge boundary, the leader streams a full database snapshot over HTTP and the replica installs it atomically.

Offline Cluster Bootstrap (from v26.5.1)

By default, when an HA cluster forms for the first time and one or more peers already hold a non-empty database on local disk, the peers do not download the database over HTTP from a single source. Instead, every peer reports a (fingerprint, lastTxId) tuple per database, the cluster elects the peer with the highest lastTxId as the source via leadership transfer, and any peer whose fingerprint matches the source bootstraps locally in seconds with zero bytes transferred.

This is the recommended path for scaling a single-instance ArcadeDB deployment up to a multi-node HA cluster after a large dataset has already been imported. Operator workflow:

  1. Import the dataset into a single ArcadeDB instance with HA disabled.

  2. Tar the database directory, or take a regular full backup.

  3. Distribute the archive out-of-band to every peer’s filesystem (init container from S3, NFS read-only mount, baked image layer, kubectl cp).

  4. Start every peer with HA enabled (arcadedb.ha.enabled=true) and the usual arcadedb.ha.serverList. No special flag is required; the bootstrap path is on by default via arcadedb.ha.bootstrapFromLocalDatabase=true.

  5. The peers form the Raft group with everyone already at the same state.

Time-to-cluster goes from "minutes-to-hours of HTTP snapshot transfer per replica" to "seconds, because everyone already has the bytes". For the Kubernetes recipe, see Pre-staging the database on every pod.

The bootstrap protocol handles the following cases automatically:

  • Identical pre-staged databases: every peer’s fingerprint matches the elected source, all peers bootstrap locally with zero bytes transferred.

  • Different-age peers: the peer with the highest lastTxId is elected as the source. Older peers reinstall the leader-shipped full snapshot; subsequent transactions are picked up by native Raft replication.

  • Late newer joiner: a peer with a strictly newer lastTxId than the cluster’s chosen baseline refuses to start and logs a SEVERE with the recovery procedure. This prevents a misconfigured rolling deploy from silently overwriting newer data with older data.

  • Cluster restart: the bootstrap path is gated on every peer’s Raft log being empty. After the cluster has committed at least one Raft entry, restarts use the regular Raft log replay or leader-shipped snapshot path; the bootstrap path does not engage, and stale local data on a single pod cannot cause divergence.

The committed bootstrap baseline (fingerprint and lastTxId per database) is visible in the response to GET /api/v1/cluster and in the Studio cluster dashboard.

Cluster Management REST API (from v26.4.1)

Cluster membership and leadership can be changed at runtime through dedicated REST endpoints. All endpoints require authentication as the root user.

Method Path Description

GET

/api/v1/cluster

Return the current cluster status: current leader, peers and their roles, current term, commit and applied indices, and per-follower replication lag.

POST

/api/v1/cluster/peer

Add a peer to the running cluster. Body: {"peerId":"<id>","address":"host:raftPort:httpPort","name":"<friendlyName>"}. The optional name is the human-readable display name shown in logs and Studio. The new peer is seeded with the current user list so authentication works immediately.

DELETE

/api/v1/cluster/peer/{peerId}

Remove a peer from the cluster.

POST

/api/v1/cluster/leader

Transfer leadership. Body: {"peerId":"<target>","timeoutMs":30000}. If peerId is omitted, Raft picks a new leader automatically.

POST

/api/v1/cluster/stepdown

Make the current leader step down. The cluster elects a new leader.

POST

/api/v1/cluster/leave

Gracefully remove this server from the cluster. If the local server is the leader, leadership is transferred first. This endpoint is used by the Kubernetes preStop hook.

POST

/api/v1/cluster/verify/{database}

Compare component file checksums for the given database across all peers. Useful to confirm that all nodes have converged.

GET

/api/v1/ha/snapshot/{database}

Leader-only: stream a ZIP of the database for follower catch-up. Requires the cluster token and is used internally by snapshot recovery.

Example: add a new peer at runtime.

$ curl -u root:<password> \
       -X POST http://leader:2480/api/v1/cluster/peer \
       -H 'Content-Type: application/json' \
       -d '{"peerId":"node4","address":"10.0.0.4:2434:2480"}'

Adding a peer with a human-readable name (from v26.5.1):

$ curl -u root:<password> \
       -X POST http://leader:2480/api/v1/cluster/peer \
       -H 'Content-Type: application/json' \
       -d '{"peerId":"10.0.0.4_2434","address":"10.0.0.4:2434:2480","name":"frankfurt"}'

Studio Cluster Dashboard (from v26.4.1)

ArcadeDB Studio exposes a Cluster tab (visible when HA is enabled) that displays the current leader, peer roles, Raft term and commit index, per-follower replication lag, and provides buttons for leadership transfer, peer add/remove, and database verification. See Studio.

Security

Cluster token

Inter-node HTTP forwarding (replica → leader proxy, snapshot downloads) is authenticated with a shared cluster token sent as the X-ArcadeDB-Cluster-Token HTTP header. If arcadedb.ha.clusterToken is not set explicitly, the token is auto-generated on first startup and persisted under raft-storage/; all subsequent nodes must start with the same token (or with HA_CLUSTER_TOKEN unset so they can read it from the same storage). Set it explicitly for production deployments to coordinate the token across nodes.

Peer allowlist

Inbound Raft gRPC connections are filtered against the DNS-resolved hosts in arcadedb.ha.serverList; connections from other addresses are rejected. This closes the "any host that knows the port can inject log entries" vector. It is not a substitute for mTLS: deploy HA behind a private network, NetworkPolicy, or service mesh on untrusted networks.

Kubernetes

For Kubernetes deployments using the official Helm chart — including the StatefulSet + headless Service pattern, auto-join on scale-up, and the preStop auto-leave hook — see High Availability on Kubernetes in the Kubernetes guide.

Troubleshooting

Performance: insertion is slow

ArcadeDB uses an optimistic concurrency model: if two threads try to update the same page, the first wins, the second throws a ConcurrentModificationException and the client retries (configurable number of times). In HA mode the retry window is wider because file locks are held during the Raft round-trip.

If you are inserting many records in parallel, allocate one bucket per thread to eliminate contention. Example for the vertex type User:

ALTER TYPE User BucketSelectionStrategy `thread`

With enough buckets, parallel insertions avoid page contention entirely and do not hit the retry path.

Verbose HA logging

Enable detailed HA logging for diagnostics via arcadedb.ha.logVerbose:

  • 0 — off (default).

  • 1 — basic: elections, leader changes, peer membership.

  • 2 — detailed: commands, WAL replication, schema.

  • 3 — trace: every state machine apply.

HA Settings

The following settings control HA behavior. A complete list of all HA parameters is available in Server Settings.

Setting Description Default Value

arcadedb.ha.enabled

Enables HA for this server

false

arcadedb.ha.clusterName

Cluster name. Useful when running multiple clusters in the same network

arcadedb

arcadedb.ha.serverList

Comma-separated list of peers in the format [name@]host:raftPort:httpPort[:priority]. The optional name@ prefix (since v26.5.1) gives the peer a human-readable name used in logs and Studio; see Named Peers. The optional priority (integer, default 0) prefers higher-valued nodes during elections. Example: frankfurt@localhost:2434:2480:10,[email protected]:2434:2480:0

(empty)

arcadedb.ha.raftPort

Default Raft gRPC port used when an entry in serverList does not specify one

2434

arcadedb.ha.quorum

Write quorum: majority or all

majority

arcadedb.ha.quorumTimeout

Timeout in ms waiting for the quorum acknowledgment

10000

arcadedb.ha.readConsistency

Default read consistency for follower reads: eventual, read_your_writes, linearizable

read_your_writes

arcadedb.ha.electionTimeoutMin / arcadedb.ha.electionTimeoutMax

Minimum and maximum Raft election timeout in ms. Increase for WAN clusters

2000 / 5000

arcadedb.ha.serverRole

Node role: any (default) or replica (pinned follower, never elected leader)

any

arcadedb.ha.snapshotThreshold

Number of Raft log entries after which the leader takes a snapshot

100000

arcadedb.ha.bootstrapFromLocalDatabase

When true (default), at first cluster formation peers exchange (fingerprint, lastTxId) per database and the highest-lastTxId peer is elected as the bootstrap source. See Offline Cluster Bootstrap.

true

arcadedb.ha.bootstrapTimeoutMs

Maximum time in ms the bootstrap leader waits for every peer in serverList to report its bootstrap state before proceeding.

120000

arcadedb.ha.groupCommitQueueSize / arcadedb.ha.groupCommitOfferTimeout

Bounded queue size and offer timeout (ms) for Raft group-commit backpressure

10000 / 100

arcadedb.ha.raftPersistStorage

If true, the Raft storage directory is preserved across restarts, enabling rejoin without a full snapshot resync

false

arcadedb.ha.stopServerOnReplicationFailure

If true, the JVM exits after exhausting step-down retries on a phase-2 replication failure. Default false keeps the server up and logs CRITICAL

false

arcadedb.ha.clusterToken

Shared secret for inter-node authentication. If empty, auto-generated at first startup and persisted under raft-storage/

(empty)

arcadedb.ha.logVerbose

Verbose HA logging: 0=off, 1=basic, 2=detailed, 3=trace

0

arcadedb.ha.k8s

The server is running inside Kubernetes (enables auto-join)

false

arcadedb.ha.k8sSuffix

DNS suffix used to reach the other servers in Kubernetes, e.g. .arcadedb.default.svc.cluster.local

(empty)