Indexes

An index is a secondary data structure that speeds up a specific lookup pattern (equality, range, similarity, text search, spatial predicate) on a property — or on a composite of properties — of a type. Without an index, the query engine has to scan every record of the target type and evaluate the filter on each one. With the right index, it jumps straight to the matching records.

Indexes are first-class schema objects in ArcadeDB: they live in their own files, are kept in sync transactionally with the data, and survive restart and replication.

Why (and when) to add an index

Indexes are not free. Each insert, update, and delete also touches every index defined on the affected property, so a type with five indexes pays roughly five times the write cost of a type with none. The break-even point is the read/write ratio of the workload: if you read a property far more than you write it, an index is almost always worth it; if you mostly append and rarely filter, you can skip it.

Add an index when:

You repeatedly filter by a property in WHERE, JOIN, or graph traversal (e.g. WHERE email = ?, MATCH (u {username: $u})).
You sort or paginate by a property and need natural ordering (ORDER BY created_at).
You enforce a uniqueness constraint (e.g. no two users with the same email).
You need a different access pattern than ordered scans — vector similarity, full-text search, or spatial containment.

Skip an index when:

The type is small enough that a full scan is already fast (a few thousand records).
The property is write-hot and almost never queried.
The selectivity is poor — indexing a boolean column with two values rarely beats a scan.

Index Types at a Glance

Index What it does Use it for SQL keyword

Index	What it does	Use it for	SQL keyword
LSM Tree (default)	Ordered key index on disk. O(log N) lookups, native range scans, natural `ORDER BY`.	Equality, range, `ORDER BY`, joins, uniqueness constraints.	`LSM_TREE`
Hash Index	Unordered hash index with O(1) point lookups, no range support.	Primary-key access, JOINs, edge traversal where ordering is irrelevant.	`HASH`
Dense Vector (HNSW)	HNSW graph over `float32` (or `int8`) embeddings; configurable similarity metric.	Semantic similarity search over text/image/audio embeddings (RAG, recommendation).	`LSM_VECTOR`
Sparse Vector	Inverted posting lists on `(dim_id, weight)` pairs with WAND retrieval.	Learned-sparse retrieval (SPLADE, BGE-M3) and BM25-style keyword recall.	`LSM_SPARSE_VECTOR`
Full-Text	Lucene-backed tokeniser + analyser pipeline.	Natural-language text search, fuzzy matching, "more like this", relevance scoring.	`FULL_TEXT`
Geospatial	Quad-key index over WKT geometries.	Spatial predicates (`ST_Within`, `ST_Intersects`, distance, bounding-box queries).	(geometry property)

LSM Tree (default)

Ordered key index on disk. O(log N) lookups, native range scans, natural ORDER BY.

Equality, range, ORDER BY, joins, uniqueness constraints.

LSM_TREE

Hash Index

Unordered hash index with O(1) point lookups, no range support.

Primary-key access, JOINs, edge traversal where ordering is irrelevant.

HASH

Dense Vector (HNSW)

HNSW graph over float32 (or int8) embeddings; configurable similarity metric.

Semantic similarity search over text/image/audio embeddings (RAG, recommendation).

LSM_VECTOR

Sparse Vector

Inverted posting lists on (dim_id, weight) pairs with WAND retrieval.

Learned-sparse retrieval (SPLADE, BGE-M3) and BM25-style keyword recall.

LSM_SPARSE_VECTOR

Full-Text

Lucene-backed tokeniser + analyser pipeline.

Natural-language text search, fuzzy matching, "more like this", relevance scoring.

FULL_TEXT

Geospatial

Quad-key index over WKT geometries.

Spatial predicates (ST_Within, ST_Intersects, distance, bounding-box queries).

(geometry property)

Each index can additionally be made unique (rejects duplicates), case-insensitive (see COLLATE CI), and configured with a null strategy (skip nulls, error on null, or index nulls).

LSM Tree (default)

ArcadeDB’s default index uses a Log-Structured Merge Tree. New keys land in an in-memory write buffer; when the buffer fills, it is flushed to disk as an immutable sorted page. Background compaction periodically merges these pages so that lookups visit only a handful of files even as the index grows. You can also trigger a compaction on demand with COMPACT INDEX, for example to settle an index right after a bulk load. When you need to build a large LSM index over data that is already loaded, the sorted bulk build constructs it directly in compacted form and is dramatically faster than the record-at-a-time build.

Why ArcadeDB picks LSM by default:

Write-friendly: inserts hit memory first, no in-place page updates, no random-write amplification.
Compaction is cheap: data is already sorted, so merging two pages is a linear scan rather than a B+Tree rebalance.
Range and ordered access are native: keys are stored in order, so ORDER BY and WHERE x BETWEEN ? walk the index directly without an in-memory sort.
Space-efficient: keys are packed and compressed, with no half-empty B+Tree pages.

The classic B+Tree (the default in most relational databases) trades these write-side wins for slightly faster random-access reads on a steady-state index. If your workload is read-only and fits in memory, B+Tree can win small constants; under ingest pressure, mixed read/write, or large datasets on slow storage, LSM almost always wins. For the on-disk format and compaction details, see LSM-Tree Internals in the reference.

LSM indexes come in four flavours, picked at CREATE INDEX time:

UNIQUE — rejects duplicate keys.
NOTUNIQUE — allows duplicates; the index returns every record with that key.
FULL_TEXT — see Full-Text Index.
Special variants for vector and spatial workloads — see the rest of this page.

Hash Index

When you only ever do point lookups (WHERE id = ?) and never sort or range-scan, a hash index is the right tool: each lookup is one or two page reads regardless of index size, with none of LSM’s compaction overhead.

ArcadeDB uses extendable hashing, a disk-oriented variant:

Each key is hashed to a binary code; the index maintains a global depth — the number of leading bits used as a prefix.
A directory maps every 2^globalDepth prefix to a bucket page. Multiple directory entries can point at the same bucket while the bucket’s local depth is shallower than the global depth.
A lookup hashes the key, reads the directory entry for the matching prefix, and binary-searches the target bucket page.
When a bucket overflows, only that bucket splits — the local depth bumps by one, entries are redistributed, and the directory doubles only when the local depth exceeds the global depth.

Use case Hash LSM Tree

Use case	Hash	LSM Tree
Point lookup (`WHERE id = ?`)	Best — O(1), 1–2 page reads	O(log N), several page reads
JOIN / edge traversal	Best — constant-time resolution	Good
Range scan (`WHERE age > 30`)	Not supported	Best — ordered iteration
`ORDER BY` on indexed field	Not supported	Best — natural ordering
Steady-state insert throughput	Consistent — no compaction	May dip during compaction

Point lookup (WHERE id = ?)

Best — O(1), 1–2 page reads

O(log N), several page reads

JOIN / edge traversal

Best — constant-time resolution

Good

Range scan (WHERE age > 30)

Not supported

Best — ordered iteration

ORDER BY on indexed field

Not supported

Best — natural ordering

Steady-state insert throughput

Consistent — no compaction

May dip during compaction

Two modes are available: UNIQUE_HASH (rejects duplicates) and NOTUNIQUE_HASH (allows duplicates). The same null strategies as LSM are supported (SKIP, ERROR, INDEX).

Dense Vector Index (LSM_VECTOR)

LSM_VECTOR indexes dense float32 (or quantized int8) embeddings using an HNSW graph layered on top of the LSM-Tree storage backbone. Use it for semantic similarity over text, image, or audio embeddings produced by a model — e.g. retrieval-augmented generation, "find similar products", or content-based recommendation.

What you get out of the box:

Similarity metrics: COSINE (default), DOT_PRODUCT, EUCLIDEAN.
Quantization: NONE, INT8, BINARY, or PRODUCT (index-internal compression that trades a bit of recall for a 4–8× memory reduction and a 2–3× search speed-up).
Wire encoding: store the property as ARRAY_OF_FLOATS or as a BINARY byte-per-dim payload (encoding: INT8) to cut HTTP and bucket size 4× when your model emits int8 natively.
Persistent, transactional, replicated: like every other ArcadeDB index, it survives restart, joins HA replication, and is updated inside the originating transaction.

See Vector Search for the full parameter table (dimensions, efSearch, maxConnections, beam width, multi-layer HNSW, on-graph storage) and worked examples in SQL, Cypher, and Java.

For wide vertices, declare the embedding property as EXTERNAL true so the bytes move out of the main bucket. Traversals that don’t project the vector stop paying for it in cache misses. See Store Embeddings in an EXTERNAL Property.

Sparse Vector Index (LSM_SPARSE_VECTOR)

Available since ArcadeDB v26.5.1.

Where dense vectors collapse meaning into a few hundred floats, sparse vectors keep only the non-zero positions of a high-dimensional vocabulary — typically tens of thousands of token slots with a handful of non-zero weights per document. They are the natural output of learned-sparse retrieval models (SPLADE, BGE-M3, OpenSearch sparse encoders) and of BM25-as-sparse-vector pipelines, and they excel at exact-term recall where dense embeddings get fuzzy.

LSM_SPARSE_VECTOR stores the data as an inverted index — a posting list per dimension keyed by (dim_id, rid, weight) — and retrieves with document-at-a-time WAND, pruning postings whose cumulative upper bound cannot beat the current top-K score.

Per-document storage is two parallel array properties on the same record:

ARRAY_OF_INTEGERS of non-zero dimension ids, and
ARRAY_OF_FLOATS of the matching weights (non-negative).

Pair a dense and a sparse index on the same record for hybrid retrieval: dense vectors recover paraphrased / semantically close matches, sparse vectors anchor on exact terms, and Reciprocal Rank Fusion combines both rankings without score-scale calibration. See Sparse Vector Search and the surrounding hybrid-retrieval examples for the full reference and worked queries.

Full-Text Index

FULL_TEXT is built on top of ArcadeDB’s LSM Tree — the index storage, ACID transactions, WAL, background compaction, replication, and HA all come for free from the underlying LSM implementation. Lucene is layered on top only for what it does best: the analyzer / tokenizer / stemmer pipeline that turns a free-text field into the searchable terms the LSM stores.

The result is a Lucene-quality search experience over text — tokenization, stemming, fuzzy match, "more like this" recommendations, relevance ranking — without the operational cost of running a separate Lucene index file format alongside the database.

Highlights:

LSM-Tree storage under the hood: same crash-safe writes, compaction, and replication path as every other ArcadeDB index — no separate Lucene segments to manage.
Pluggable analyzer per field (standard, stop-word, language-specific, custom) for tokenization and stemming.
Full Lucene query syntax at query time: phrase, proximity, fuzzy (~), wildcards, boolean operators, and caret term boosts (term^N).
BM25 relevance scoring (TF/IDF + document-length normalization, with tunable parameters and per-field boosts; legacy match-count CLASSIC scoring is still available) exposed as the $score query variable, so you can ORDER BY $score DESC and rank results.
"More like this" support to find documents similar to a given record by their token distribution.

For the full create-index syntax, analyzer configuration, query language, scoring, and the MLT (more-like-this) configuration, see Full-Text Index in the data-modeling guide.

Geospatial Index

A geospatial index sits on a property holding a WKT geometry string (Point, LineString, Polygon, MultiPolygon). The index encodes each shape’s bounding region with a configurable quad-key precision so spatial predicates touch only the candidate cells instead of every row.

What it accelerates:

ST_Within, ST_Contains, ST_Intersects and the other standard spatial predicates.
Bounding-box queries and "find everything within N km of point P".
Joins between two spatial types (which features lie inside which polygons).

For the full SQL surface, precision tuning, and worked queries, see Geospatial Index in the data-modeling guide.

Case-Insensitive Indexes (COLLATE CI)

The text indexes above are case-sensitive by default: "Hello" and "hello" are different keys. Add COLLATE CI to a property in the CREATE INDEX statement to fold case at index time. This affects LSM and Hash indexes; it does not apply to vector, sparse-vector, or full-text indexes (which have their own normalization story).

Effect:

Equality lookups fold case: WHERE Name = 'Hello World' matches "Hello World", "HELLO WORLD", "hello world".
Range queries compare the lowercased form.
Unique constraints fold case: a UNIQUE index with COLLATE CI prevents inserting both "Admin" and "admin".
Original values are preserved in the document; only the index key is lowercased (using the root locale).

CREATE INDEX ON <type> (<property> COLLATE CI) <index-type>

Composite indexes can mix policies per property:

CREATE INDEX ON Product (Name COLLATE CI, Code) UNIQUE

Here Name is case-insensitive while Code remains case-sensitive.

The query optimizer recognizes .toLowerCase() on a property with a COLLATE CI index and rewrites the query to use the index directly:

SELECT FROM Product WHERE Name.toLowerCase() = 'hello world'

So existing applications that fold case in queries get the speedup for free.

When to reach for it:

User-facing lookups: usernames, emails, product names — the end-user expects case-insensitive matching.
Deduplication: a uniqueness constraint that ignores case.
Replacing toLowerCase() patterns: the conversion happens once at insert time instead of on every query.

Building a large index over existing data (sorted bulk build)

Available since ArcadeDB v26.8.1. Currently exposed through the Java API only (no SQL keyword yet).

Creating an index on a type that already holds data has to read every existing record and add its key to the index. The default path inserts those keys one at a time through the normal write buffer, exactly as if the records were being indexed live. That is fine for small and medium types, but it does not scale linearly: as the index grows, each insert does more work and the buffer flushes and compacts repeatedly mid-build, so the effective build rate degrades as it runs. On tens of millions of records the index step, not the data load, becomes the bottleneck.

The sorted bulk build is an opt-in, offline build strategy for exactly this situation. Instead of inserting keys one at a time, it:

scans the records and extracts the same index keys the normal path would (scalars, composites, list/map expansions, nulls, and collation are all preserved);
builds bounded, sorted runs in memory and spills them to disk when a configurable memory budget is reached, so the build stays within RAM regardless of dataset size;
k-way merges the runs into one globally sorted key stream; and
writes that stream directly as compacted on-disk pages.

Because the keys arrive already sorted, this replaces a growing sequence of against-a-growing-tree inserts with a bounded external sort followed by a sequential page write. The output is already in compacted form, so you also skip the post-build COMPACT INDEX step. On a 10-million-key index this reaches equivalent compacted output several times faster than the default build followed by a compaction.

database.getSchema().buildTypeIndex("Document", new String[] { "uri" })
    .withType(Schema.INDEX_TYPE.LSM_TREE)
    .withUnique(true)                // both unique and non-unique are supported
    .withBuildMode(IndexBuildMode.SORTED)
    .withBuildMemoryBudget(1L << 30) // optional: RAM budget in bytes; 0 (default) auto-sizes from the heap
    .withBuildMergeFanIn(8)          // optional: number of runs merged per pass (default 8, min 2)
    .create();

All three build options are optional: withBuildMemoryBudget bounds the RAM used before spilling (defaults to a fraction of the heap), withBuildMergeFanIn sets how many sorted runs are merged per pass, and withBuildSpillDirectory(Path) chooses where the temporary spill files are written (defaults to the database directory). They are ignored under IndexBuildMode.DEFAULT.

For a unique index, uniqueness is enforced directly on the globally sorted stream: two records that produce the same key are adjacent after the sort, so a duplicate is detected in a single pass (across all buckets, with no extra sort or in-memory hash set) and the build throws DuplicatedKeyException and publishes nothing. As with online unique indexes, an all-null key is exempt (SQL NULL != NULL), while a partial-null composite key is still constrained; the null strategy (SKIP, ERROR, INDEX) behaves exactly as it does for a normal build.

The sorted mode is an explicit operator choice with strict preconditions. It is restricted to:

the LSM_TREE index type (unique or non-unique),
a single-server, non-replicated database (not part of an HA cluster), and
no active transaction on the calling thread (the build runs in its own exclusive window).

If any precondition is not met, the build fails immediately, before creating any file — it does not silently fall back to the slower default path, so you always get the strategy you asked for. Everything else (online reads and writes, HA replication) is unaffected because unsupported requests are rejected up front. Leave withBuildMode unset (or set it to IndexBuildMode.DEFAULT) to keep the standard record-at-a-time build.

Parallel construction, an automatic warn-and-fall-back mode, HA/replicated operation, and a SQL syntax are planned as follow-ups.

What can be indexed

ArcadeDB indexes can be defined on properties of nearly any type, with the right index choice per data shape:

Property kind Indexable Typical index

Property kind	Indexable	Typical index
Scalar (`STRING`, `INTEGER`, `LONG`, `FLOAT`, `DOUBLE`, `DATE`, `DATETIME`, `BOOLEAN`)	Yes	`LSM_TREE` for equality + range, `HASH` for point-only
Composite of N properties	Yes	`LSM_TREE` (ordered prefix scans) or `HASH`
`STRING` for natural language	Yes	`FULL_TEXT`
`ARRAY_OF_FLOATS` / `BINARY` embedding	Yes	`LSM_VECTOR`
`ARRAY_OF_INTEGERS` + `ARRAY_OF_FLOATS` (paired)	Yes	`LSM_SPARSE_VECTOR`
`STRING` containing WKT geometry	Yes	Geospatial index
`LIST` / `ARRAY_OF_*` (per element)	Yes	`LSM_TREE` indexed by values
`MAP`	Yes	`LSM_TREE` indexed by keys or by values

Scalar (STRING, INTEGER, LONG, FLOAT, DOUBLE, DATE, DATETIME, BOOLEAN)

Yes

LSM_TREE for equality + range, HASH for point-only

Composite of N properties

Yes

LSM_TREE (ordered prefix scans) or HASH

STRING for natural language

Yes

FULL_TEXT

ARRAY_OF_FLOATS / BINARY embedding

Yes

LSM_VECTOR

ARRAY_OF_INTEGERS + ARRAY_OF_FLOATS (paired)

Yes

LSM_SPARSE_VECTOR

STRING containing WKT geometry

Yes

Geospatial index

LIST / ARRAY_OF_* (per element)

Yes

LSM_TREE indexed by values

MAP

Yes

LSM_TREE indexed by keys or by values

Unique and non-unique modes are available for the scalar and composite cases. For the exact SQL command syntax, see CREATE INDEX; to drop or rebuild after a schema change, see DROP INDEX and REBUILD INDEX; to force a compaction on demand (for example after a bulk load), see COMPACT INDEX.

Indexes

Why (and when) to add an index

Index Types at a Glance

LSM Tree (default)

Hash Index

Dense Vector Index (LSM_VECTOR)

Sparse Vector Index (LSM_SPARSE_VECTOR)

Full-Text Index

Geospatial Index

Case-Insensitive Indexes (COLLATE CI)

Building a large index over existing data (sorted bulk build)

What can be indexed

See also