Indexes

An index is a secondary data structure that speeds up a specific lookup pattern (equality, range, similarity, text search, spatial predicate) on a property — or on a composite of properties — of a type. Without an index, the query engine has to scan every record of the target type and evaluate the filter on each one. With the right index, it jumps straight to the matching records.

Indexes are first-class schema objects in ArcadeDB: they live in their own files, are kept in sync transactionally with the data, and survive restart and replication.

Why (and when) to add an index

Indexes are not free. Each insert, update, and delete also touches every index defined on the affected property, so a type with five indexes pays roughly five times the write cost of a type with none. The break-even point is the read/write ratio of the workload: if you read a property far more than you write it, an index is almost always worth it; if you mostly append and rarely filter, you can skip it.

Add an index when:

  • You repeatedly filter by a property in WHERE, JOIN, or graph traversal (e.g. WHERE email = ?, MATCH (u {username: $u})).

  • You sort or paginate by a property and need natural ordering (ORDER BY created_at).

  • You enforce a uniqueness constraint (e.g. no two users with the same email).

  • You need a different access pattern than ordered scans — vector similarity, full-text search, or spatial containment.

Skip an index when:

  • The type is small enough that a full scan is already fast (a few thousand records).

  • The property is write-hot and almost never queried.

  • The selectivity is poor — indexing a boolean column with two values rarely beats a scan.

Index Types at a Glance

Index What it does Use it for SQL keyword

LSM Tree (default)

Ordered key index on disk. O(log N) lookups, native range scans, natural ORDER BY.

Equality, range, ORDER BY, joins, uniqueness constraints.

LSM_TREE

Hash Index

Unordered hash index with O(1) point lookups, no range support.

Primary-key access, JOINs, edge traversal where ordering is irrelevant.

HASH

Dense Vector (HNSW)

HNSW graph over float32 (or int8) embeddings; configurable similarity metric.

Semantic similarity search over text/image/audio embeddings (RAG, recommendation).

LSM_VECTOR

Sparse Vector

Inverted posting lists on (dim_id, weight) pairs with WAND retrieval.

Learned-sparse retrieval (SPLADE, BGE-M3) and BM25-style keyword recall.

LSM_SPARSE_VECTOR

Full-Text

Lucene-backed tokeniser + analyser pipeline.

Natural-language text search, fuzzy matching, "more like this", relevance scoring.

FULL_TEXT

Geospatial

Quad-key index over WKT geometries.

Spatial predicates (ST_Within, ST_Intersects, distance, bounding-box queries).

(geometry property)

Each index can additionally be made unique (rejects duplicates), case-insensitive (see COLLATE CI), and configured with a null strategy (skip nulls, error on null, or index nulls).

LSM Tree (default)

ArcadeDB’s default index uses a Log-Structured Merge Tree. New keys land in an in-memory write buffer; when the buffer fills, it is flushed to disk as an immutable sorted page. Background compaction periodically merges these pages so that lookups visit only a handful of files even as the index grows.

Why ArcadeDB picks LSM by default:

  • Write-friendly: inserts hit memory first, no in-place page updates, no random-write amplification.

  • Compaction is cheap: data is already sorted, so merging two pages is a linear scan rather than a B+Tree rebalance.

  • Range and ordered access are native: keys are stored in order, so ORDER BY and WHERE x BETWEEN ? walk the index directly without an in-memory sort.

  • Space-efficient: keys are packed and compressed, with no half-empty B+Tree pages.

The classic B+Tree (the default in most relational databases) trades these write-side wins for slightly faster random-access reads on a steady-state index. If your workload is read-only and fits in memory, B+Tree can win small constants; under ingest pressure, mixed read/write, or large datasets on slow storage, LSM almost always wins. For the on-disk format and compaction details, see LSM-Tree Internals in the reference.

LSM indexes come in four flavours, picked at CREATE INDEX time:

  • UNIQUE — rejects duplicate keys.

  • NOTUNIQUE — allows duplicates; the index returns every record with that key.

  • FULL_TEXT — see Full-Text Index.

  • Special variants for vector and spatial workloads — see the rest of this page.

Hash Index

When you only ever do point lookups (WHERE id = ?) and never sort or range-scan, a hash index is the right tool: each lookup is one or two page reads regardless of index size, with none of LSM’s compaction overhead.

ArcadeDB uses extendable hashing, a disk-oriented variant:

  1. Each key is hashed to a binary code; the index maintains a global depth — the number of leading bits used as a prefix.

  2. A directory maps every 2globalDepth prefix to a bucket page. Multiple directory entries can point at the same bucket while the bucket’s local depth is shallower than the global depth.

  3. A lookup hashes the key, reads the directory entry for the matching prefix, and binary-searches the target bucket page.

  4. When a bucket overflows, only that bucket splits — the local depth bumps by one, entries are redistributed, and the directory doubles only when the local depth exceeds the global depth.

Use case Hash LSM Tree

Point lookup (WHERE id = ?)

Best — O(1), 1–2 page reads

O(log N), several page reads

JOIN / edge traversal

Best — constant-time resolution

Good

Range scan (WHERE age > 30)

Not supported

Best — ordered iteration

ORDER BY on indexed field

Not supported

Best — natural ordering

Steady-state insert throughput

Consistent — no compaction

May dip during compaction

Two modes are available: UNIQUE_HASH (rejects duplicates) and NOTUNIQUE_HASH (allows duplicates). The same null strategies as LSM are supported (SKIP, ERROR, INDEX).

Dense Vector Index (LSM_VECTOR)

LSM_VECTOR indexes dense float32 (or quantized int8) embeddings using an HNSW graph layered on top of the LSM-Tree storage backbone. Use it for semantic similarity over text, image, or audio embeddings produced by a model — e.g. retrieval-augmented generation, "find similar products", or content-based recommendation.

What you get out of the box:

  • Similarity metrics: COSINE (default), DOT_PRODUCT, EUCLIDEAN.

  • Quantization: NONE, INT8, BINARY, or PRODUCT (index-internal compression that trades a bit of recall for a 4–8× memory reduction and a 2–3× search speed-up).

  • Wire encoding: store the property as ARRAY_OF_FLOATS or as a BINARY byte-per-dim payload (encoding: INT8) to cut HTTP and bucket size 4× when your model emits int8 natively.

  • Persistent, transactional, replicated: like every other ArcadeDB index, it survives restart, joins HA replication, and is updated inside the originating transaction.

See Vector Search for the full parameter table (dimensions, efSearch, maxConnections, beam width, multi-layer HNSW, on-graph storage) and worked examples in SQL, Cypher, and Java.

For wide vertices, declare the embedding property as EXTERNAL true so the bytes move out of the main bucket. Traversals that don’t project the vector stop paying for it in cache misses. See Store Embeddings in an EXTERNAL Property.

Sparse Vector Index (LSM_SPARSE_VECTOR)

Available since ArcadeDB v26.5.1.

Where dense vectors collapse meaning into a few hundred floats, sparse vectors keep only the non-zero positions of a high-dimensional vocabulary — typically tens of thousands of token slots with a handful of non-zero weights per document. They are the natural output of learned-sparse retrieval models (SPLADE, BGE-M3, OpenSearch sparse encoders) and of BM25-as-sparse-vector pipelines, and they excel at exact-term recall where dense embeddings get fuzzy.

LSM_SPARSE_VECTOR stores the data as an inverted index — a posting list per dimension keyed by (dim_id, rid, weight) — and retrieves with document-at-a-time WAND, pruning postings whose cumulative upper bound cannot beat the current top-K score.

Per-document storage is two parallel array properties on the same record:

  • ARRAY_OF_INTEGERS of non-zero dimension ids, and

  • ARRAY_OF_FLOATS of the matching weights (non-negative).

Pair a dense and a sparse index on the same record for hybrid retrieval: dense vectors recover paraphrased / semantically close matches, sparse vectors anchor on exact terms, and Reciprocal Rank Fusion combines both rankings without score-scale calibration. See Sparse Vector Search and the surrounding hybrid-retrieval examples for the full reference and worked queries.

Full-Text Index

FULL_TEXT is built on top of ArcadeDB’s LSM Tree — the index storage, ACID transactions, WAL, background compaction, replication, and HA all come for free from the underlying LSM implementation. Lucene is layered on top only for what it does best: the analyzer / tokenizer / stemmer pipeline that turns a free-text field into the searchable terms the LSM stores.

The result is a Lucene-quality search experience over text — tokenization, stemming, fuzzy match, "more like this" recommendations, relevance ranking — without the operational cost of running a separate Lucene index file format alongside the database.

Highlights:

  • LSM-Tree storage under the hood: same crash-safe writes, compaction, and replication path as every other ArcadeDB index — no separate Lucene segments to manage.

  • Pluggable analyzer per field (standard, stop-word, language-specific, custom) for tokenization and stemming.

  • Full Lucene query syntax at query time: phrase, proximity, fuzzy (~), wildcards, boolean operators.

  • Relevance score is exposed as a synthetic projection so you can ORDER BY score() and rank results.

  • "More like this" support to find documents similar to a given record by their token distribution.

For the full create-index syntax, analyzer configuration, query language, scoring, and the MLT (more-like-this) configuration, see Full-Text Index in the data-modeling guide.

Geospatial Index

A geospatial index sits on a property holding a WKT geometry string (Point, LineString, Polygon, MultiPolygon). The index encodes each shape’s bounding region with a configurable quad-key precision so spatial predicates touch only the candidate cells instead of every row.

What it accelerates:

  • ST_Within, ST_Contains, ST_Intersects and the other standard spatial predicates.

  • Bounding-box queries and "find everything within N km of point P".

  • Joins between two spatial types (which features lie inside which polygons).

For the full SQL surface, precision tuning, and worked queries, see Geospatial Index in the data-modeling guide.

Case-Insensitive Indexes (COLLATE CI)

The text indexes above are case-sensitive by default: "Hello" and "hello" are different keys. Add COLLATE CI to a property in the CREATE INDEX statement to fold case at index time. This affects LSM and Hash indexes; it does not apply to vector, sparse-vector, or full-text indexes (which have their own normalization story).

Effect:

  • Equality lookups fold case: WHERE Name = 'Hello World' matches "Hello World", "HELLO WORLD", "hello world".

  • Range queries compare the lowercased form.

  • Unique constraints fold case: a UNIQUE index with COLLATE CI prevents inserting both "Admin" and "admin".

  • Original values are preserved in the document; only the index key is lowercased (using the root locale).

CREATE INDEX ON <type> (<property> COLLATE CI) <index-type>

Composite indexes can mix policies per property:

CREATE INDEX ON Product (Name COLLATE CI, Code) UNIQUE

Here Name is case-insensitive while Code remains case-sensitive.

The query optimizer recognizes .toLowerCase() on a property with a COLLATE CI index and rewrites the query to use the index directly:

SELECT FROM Product WHERE Name.toLowerCase() = 'hello world'

So existing applications that fold case in queries get the speedup for free.

When to reach for it:

  • User-facing lookups: usernames, emails, product names — the end-user expects case-insensitive matching.

  • Deduplication: a uniqueness constraint that ignores case.

  • Replacing toLowerCase() patterns: the conversion happens once at insert time instead of on every query.

What can be indexed

ArcadeDB indexes can be defined on properties of nearly any type, with the right index choice per data shape:

Property kind Indexable Typical index

Scalar (STRING, INTEGER, LONG, FLOAT, DOUBLE, DATE, DATETIME, BOOLEAN)

Yes

LSM_TREE for equality + range, HASH for point-only

Composite of N properties

Yes

LSM_TREE (ordered prefix scans) or HASH

STRING for natural language

Yes

FULL_TEXT

ARRAY_OF_FLOATS / BINARY embedding

Yes

LSM_VECTOR

ARRAY_OF_INTEGERS + ARRAY_OF_FLOATS (paired)

Yes

LSM_SPARSE_VECTOR

STRING containing WKT geometry

Yes

Geospatial index

LIST / ARRAY_OF_* (per element)

Yes

LSM_TREE indexed by values

MAP

Yes

LSM_TREE indexed by keys or by values

Unique and non-unique modes are available for the scalar and composite cases. For the exact SQL command syntax, see CREATE INDEX; to drop or rebuild after a schema change, see DROP INDEX and REBUILD INDEX.

See also