Record
A record is the smallest unit you can load from and store in the database. Records come in three types:
-
Document
-
Vertex
-
Edge
Document
Documents are softly typed and are defined by schema types, but you can also use them in a schema-less mode too. Documents handle fields in a flexible manner. You can easily import and export them in JSON format. For example,
{
"name":"Jay",
"surname":"Miner",
"job":"Developer",
"creations":[{
"name":"Amiga 1000",
"company":"Commodore Inc."
},{
"name":"Amiga 500",
"company":"Commodore Inc."
}]
}
Vertex
In graph databases the vertices (also: vertices), or nodes represent the main entity that holds the information. It can be a patient, a company or a product. Vertices are themselves documents with some additional features. This means they can contain embedded records and arbitrary properties exactly like documents. Vertices are connected with other vertices through edges.
Edge
An edge, or arc, is the connection between two vertices. Edges can be unidirectional and bidirectional. One edge can only connect two vertices. Edges, like vertices, are also documents with additional features.
For more information on connecting vertices in general, see Relationships below.
Record ID
When ArcadeDB generates a record, it auto-assigns a unique identifier called a Record ID, RID for short.
The syntax for the RID is the pound symbol (#) with the bucket identifier, colon (:), and the position like so:
#<bucket-identifier>:<record-position>.
-
bucket-identifier: This number indicates the bucket id to which the record belongs. Positive numbers in the bucket identifier indicate persistent records. You can have up to 2,147,483,648 buckets in a database.
-
record-position: This number defines the absolute position of the record in the bucket.
A special case is #-1:-1 symbolizing the null RID.
The prefix character # is mandatory.
|
Each Record ID is immutable, universal, and only reused when configured to, see bucketReuseSpaceMode.
Additionally, records can be accessed directly through their RIDs at O(1) complexity which means the query speed is constant, unaffected by database size.
For this reason, you don’t need to create a field to serve as the primary key as you do in relational databases.
Record Retrieval Complexity
Retrieving a record by RID is of complexity O(1).
This is possible as the RID itself encodes both, the file a record is stored in, and the position inside it.
In an RID, i.e. #12:1000000, the bucket identifier (here #12) specifies the record’s associated file, while the record position (here 1000000) describes the position inside the file.
Bucket files are organized in pages (with default size 64KB) with a maximum number records per page (by default 2048).
To determine the byte position of a record in a bucket file, the rounded down quotient of record position and maximum records per page yields the page (here ⌊1000000 / 2048⌋), and the remainder gives the position on the page (here ⌊1000000 % 2048⌋).
In pseudo-code this computation is given by:
int pageId = floor(rid.getPosition() / maxRecordsInPage);
int positionInPage = floor(rid.getPosition() % maxRecordsInPage);
Types
The concept of type is taken from the Object Oriented Programming paradigm, sometimes known as "Class". In ArcadeDB, types define records. It is closest to the concept of a "Table" in relational databases and a "Class" in an object database.
Types can be schema-less, schema-full, or a mix. They can inherit from other types, creating a tree of types. Inheritance, in this context, means that a subtype extends a parent type, inheriting all of its properties and attributes. Practically, this is done by extending a type or setting a super-type.
Each type has its own buckets (data files). A type can support multiple buckets. When you execute a query against a type, it automatically fetches from all the buckets that are part of the type. When you create a new record, ArcadeDB selects the bucket to store it in using a configurable strategy — round-robin, per-thread, or hashed by a property.
As a default, ArcadeDB creates one bucket per type, but can be configured to for example to as many cores (processors) the host machine has, see typeDefaultBuckets.
In this, CRUD operations can go full speed in parallel with zero contention between CPUs and/or cores.
Having many buckets per type means having more files at file system level.
Check if your operating system has any limitation with the number of files supported and opened at the same time (ulimit for Unix-like systems).
You can query the defined types by executing the following SQL query: select from schema:types.
|
Buckets
Where types provide you with a logical framework for organizing data, buckets provide physical or in-memory space in which ArcadeDB actually stores the data. Each bucket is one file at file system level. It is comparable to the "collection" in Document databases, the "table" in Relational databases and the "cluster" in OrientDB. You can have up to 2,147,483,648 buckets in a database.
A bucket can only be part of one type. This means two types can not share the same bucket. Also, sub-types have their separate buckets from their super-types.
When you create a new type, the CREATE TYPE statement automatically creates the physical buckets (files) that serve as the default location in which to store data for that type.
ArcadeDB forms the bucket names by using the type name + underscore + a sequential number starting from 0. For example, the first bucket for the type Beer will be Beer_0 and the correspondent file in the file system will be Beer_0.31.65536.bucket.
By default ArcadeDB creates one bucket per type.
For massive inserts, performance can be improved by creating additional buckets and hence taking advantage of parallelism, i.e. by creating one bucket for each CPU core on the server.
Types vs. Buckets in Queries
The combination of types and buckets is very powerful and has a number of use cases. In most case, you can work with types and you will be fine. But if you are able to split your database into multiple buckets, you could address a specific bucket based instead of the entire type. By wisely using the buckets to divide your database in a way that help you with the retrieval means zero or less use of indexes. Indexes slow down insertion and take space on disk and RAM. In most cases you need indexes to speed up your queries, but in some use cases you could totally or partially avoid using indexes and still having good performance on queries.
One bucket per period
Consider an example where you create a type Invoice, with one bucket per year. Invoice_2015 and Invoice_2016.
You can query all invoices using the type as a target with the SELECT statement.
SELECT FROM Invoice
In addition to this, you can filter the result set by the year.
The type Invoice includes a year field, you can filter it through the WHERE clause.
SELECT FROM Invoice WHERE year = 2016
You can also query specific records from a single bucket.
By splitting the type Invoice across multiple buckets, (that is, one per year in our example), you can optimize the query by narrowing the potential result set.
SELECT FROM BUCKET:Invoice_2016
By using the explicit bucket instead of the logical type, this query runs significantly faster, because ArcadeDB can narrow the search to the targeted bucket.
No index is needed on the year, because all the invoices for year 2016 will be stored in the bucket Invoice_2016 by the application.
One bucket per location
Like with the example above, we could split our records by location creating one bucket per location. Example:
CREATE BUCKET Customer_Europe
CREATE BUCKET Customer_Americas
CREATE BUCKET Customer_Asia
CREATE BUCKET Customer_Other
CREATE VERTEX TYPE Customer BUCKET Customer_Europe,Customer_Americas,Customer_Asia,Customer_Other
Here we are using the graph model by creating a vertex type, but it’s the same with documents.
Use CREATE DOCUMENT TYPE instead.
Now in your application, store the vertices or documents in the right bucket, based on the location of such customer. You can use any API and set the bucket. If you’re using SQL, this is the way you can insert a new customer into a specific bucket.
INSERT INTO BUCKET:Customer_Europe CONTENT { firstName: 'Enzo', lastName: 'Ferrari' }
Since a bucket can only be part of one type, when you use the bucket notation with SQL, the type is inferred from the bucket, "Customer" in this case.
When you’re looking for customers based in Europe, you could execute this query:
SELECT FROM BUCKET:Customer_Europe
You can go even more specific by creating a bucket per country, not just for continent, and query from that bucket. Example:
CREATE BUCKET 'Customer_Europe_Italy'
CREATE BUCKET 'Customer_Europe_Spain'
Now get all the customers that live in Italy:
SELECT FROM BUCKET:Customer_Europe_Italy
You can also specify a list of buckets in your query. This is the query to retrieve both Italian and Spanish customers.
SELECT FROM BUCKET:[Customer_Europe_Italy,Customer_Europe_Spain]
Bucket Selection Strategy
When a type is backed by more than one bucket, ArcadeDB has to decide which bucket should receive each new record.
That decision is made by the type’s bucket selection strategy.
The strategy is set on the type (and inherited from the database default), so it is transparent to the application code that issues the INSERT or CREATE.
ArcadeDB ships with three built-in strategies, described in the sections below.
round-robin
The default strategy when a type has more than one bucket. Each new record is assigned to the next bucket in circular order; after the last bucket the counter wraps back to the first.
This is the right choice when you have no specific reason to choose otherwise. Writes are spread evenly across all buckets without any application-level coordination, and the algorithm has no per-record overhead.
The animation shows six successive inserts arriving at the producer and being routed to Bucket 0, Bucket 1, Bucket 2, Bucket 0, Bucket 1, Bucket 2.
thread
The bucket is chosen as (threadId mod numBuckets), so each writer thread always lands in the same bucket. With one bucket per thread, two threads writing concurrently never touch the same bucket file or page, eliminating page-level lock contention between them.
This is the fastest strategy under heavy concurrent inserts. Best results come when the number of buckets is at least equal to the number of writer threads (e.g. one bucket per CPU core; see the typeDefaultBuckets setting).
The trade-off is that record distribution depends on how evenly threads insert: if one thread does most of the work, its bucket grows much faster than the others. Pick this when peak write throughput matters more than balanced bucket sizes.
The animation shows three threads firing inserts at the same instant; each thread’s record always lands in its own dedicated bucket.
partitioned(<property>)
The bucket is chosen by hashing the value of a named property and mapping the hash to a bucket index. Two records that share the same key always land in the same bucket, deterministically.
The big win is on read. When you later query by that property, ArcadeDB knows exactly which bucket can possibly contain the row, so the index lookup hits a single sub-index instead of scanning every bucket’s index in turn. For point lookups by key this can be much faster than round-robin or thread.
The trade-offs:
-
Inserts are slightly slower than the other strategies because of the extra hashing step.
-
The distribution of records across buckets only stays balanced if the chosen property has high cardinality and reasonably uniform values. A low-cardinality key concentrates writes into a few buckets.
-
The
UPSERTclause onUPDATEonly guarantees atomicity when theWHEREcondition matches aUNIQUEindex, so this strategy is most often paired with one.
The animation shows four records with id values 42, 17, 99, 33 flowing through hash(id) % 3 and landing in their deterministic destination buckets (Bucket 0, Bucket 2, Bucket 0, Bucket 0).
Setting and changing the strategy
The strategy is inherited from a database-wide default (changeable with ALTER DATABASE) and can be overridden per type with ALTER TYPE … BucketSelectionStrategy. See the SQL reference under Bucket Selection Strategies for the exact syntax.
Schema design 101: choosing a bucket strategy
The default round-robin strategy is safe but unconditionally generic: it never produces hot buckets, but it also gives the planner zero query-time leverage. Most production schemas have an obvious partition axis (tenant, customer, region, organisation) that the default ignores. Picking a partitioned strategy at schema-creation time is one of the highest-leverage decisions you can make for a single-node deployment.
Use this 3-question decision tree before you create a new type:
-
Is the data scoped to a tenant, customer, organisation, or other high-cardinality categorical key? If yes, use
partitioned(<that-key>). Multi-tenant SaaS workloads, customer-scoped records, and per-organisation document stores all fit here. The planner will skip every bucket that doesn’t match the scope predicate, so aWHERE tenant_id = Xquery touches one HNSW graph (or one LSM-Tree per-bucket sub-index) instead of all of them. This is the optimisation that makes filterable HNSW work for SaaS use cases (see concepts/vector-search.adoc#vector-search-filterable-hnsw). -
Is the data time-series with a hot recent / cold historical split? If yes, prefer the
LSM_TIMESERIEStype’s native partitioning (see Time series) - it understands time-window roll-off and downsampling out of the box. Generic types withpartitioned(timestamp)can degenerate into a hot bucket for the current period. -
Is the workload single-tenant, mixed-cutting analytics, or a generic document store? Stick with
round-robin(the default). The other strategies' query-time benefit doesn’t materialise without a stable partition axis.
|
Anti-patterns worth calling out:
|
needsRepartition flag
A schema mutation can invalidate the partition mapping for existing records:
-
ALTER TYPE Doc BUCKET +newBucketon a partitioned type changes the modulus:hash(tenant_id) % oldCountno longer matcheshash(tenant_id) % newCountfor the same value. Existing records stay in their old buckets; new records go to the right bucket. The mapping is now stale. -
Same for removing a bucket.
-
ALTER TYPE Doc BucketSelectionStrategy 'partitioned(…)'on a populated type that previously usedround-robinputs the records under a hash they were never routed by. -
Changing the strategy’s property set (e.g.
partitioned(orgId)→partitioned(orgId, region)) on a populated type has the same effect.
When this happens, ArcadeDB sets a needsRepartition flag on the type. The flag is exposed in SELECT FROM schema:types and shown as a yellow warning banner in Studio’s type-detail panel. While the flag is set, partition-aware bucket pruning is suppressed: queries fan out across every bucket and stay correct, but lose the optimisation. The flag is cleared by a successful REBUILD TYPE … WITH repartition = true, which walks every record, moves the misplaced ones to their hash-target bucket, and then clears the flag.
You can also chain the rebuild atomically with the schema change:
ALTER TYPE Doc BUCKET +newBucket WITH repartition = true
This runs the bucket addition and the rebuild as one logical operation; the flag never goes true to subsequent queries.
|
Vertex and edge types: a repartition rebuild deletes each misplaced record from its old bucket and re-inserts it, which gives it a new RID. For graph types this breaks integrity in two ways:
There is no safe in-place fix for either, so ArcadeDB refuses |
Verifying it’s working
To check whether a query is benefiting from partition pruning, prefix it with EXPLAIN:
EXPLAIN SELECT FROM Doc WHERE tenant_id = 'acme'
In the resulting plan, look for a GetValueFromIndexEntryStep line with a narrow filtering buckets [N] value (one bucket id rather than the full set). For Cypher, the partition-pruned path is logged in the usedIndexName field as <label>[partition='<bucket>'].
Relationships
ArcadeDB supports three kinds of relationships: connections, referenced and embedded. It can manage relationships in a schema-full or schema-less scenario.
Graph Connections
As a graph database, spanning edges between vertices is one way to express a connections between records. This is the graph model’s natural way of relationsships and traversable by the SQL, Gremlin, and Cypher query languages. Internally, ArcadeDB deposes a direct (referenced) relationship for edge-wise connected vertices to ensure fast graph traversals.
Example
In ArcadeDB’s SQL, edges are created via the CREATE EDGE command.
Referenced Relationships
In Relational databases, tables are linked through JOIN commands, which can prove costly on computing resources.
ArcadeDB manages relationships natively without computing a JOIN but storing a direct LINK to the target object of the relationship: a property on the source record holds the target’s RID, so following the relationship is a single random read rather than a join.
In the example below, the Customer record at #5:23 has a property invoice whose value is the link #10:2.
The Invoice record exists independently in its own bucket and keeps its own RID; deleting the customer does not delete the invoice.
Note that referenced relationships differ from edges: references are properties connecting any record while edges are types connecting vertices, and particularly, graph traversal is only applicable to edges.
Embedded Relationships
When using Embedded relationships, ArcadeDB stores the related record inside the container record, as the value of one of its properties. These relationships are stronger than referenced relationships. You can represent them as a UML Composition relationship.
Embedded records do not have their own RID, so they cannot be referenced from other records. They are only accessible by traversing the container record’s property, and they are stored inside the container’s record (not in a separate bucket for the embedded type). Consequently, deleting the container record also deletes its embedded records.
In the example below, the Account record at #5:23 has a property address whose value is an embedded Address record.
The embedded record carries its own properties (city, street) but has no RID of its own — it lives entirely within the account.
Because the embedded Address is reachable as address on Account, you query its inner properties through that path.
The query below returns every account whose embedded address has city = 'Rome':
SELECT FROM Account WHERE address.city = 'Rome'
1:1 and n:1 Embedded Relationships
ArcadeDB expresses relationships of these kinds using the EMBEDDED type.
1:n and n:n Embedded Relationships
ArcadeDB expresses relationships of these kinds using a list or a map of links, such as:
-
LISTAn ordered list of records. -
MAPAn ordered map of records as the value and a string as the key, it doesn’t accept duplicate keys.
Inverse Relationships
In ArcadeDB, all edges in the graph model are bidirectional. This differs from the document model, where relationships are always unidirectional, requiring the developer to maintain data integrity. In addition, ArcadeDB automatically maintains the consistency of all bidirectional relationships.
Edge Constraints
ArcadeDB supports edge constraints, which means limiting the admissible vertex types that can be connected by an edge type.
To this end the implicit metadata properties @in and @out need to be made explicit by creating them.
For example, for an edge type HasParts that is supposed to connect only from vertices of type Product to vertices of type Component, this can be schemed by:
CREATE EDGE TYPE HasParts;
CREATE PROPERTY HasParts.`@out` link OF Product;
CREATE PROPERTY HasParts.`@in` link OF Component;
Relationship Traversal Complexity
As a native graph database, ArcadeDB supports index free adjacency. This means constant graph traversal complexity of O(1), independent of the graph expanse (database size).
To traverse a graph structure, one needs to follow references stored by the current record. These references are always stored as RIDs, and are not only pointers to incoming and outgoing edges, but also to connected vertices. Internally, references are managed by a stack (also known as LIFO), which allows to get the latest insertion first. As not only edges, but also connected vertices are stored, neighboring nodes can be reached directly, particularly without going via the connecting edge. This is useful if edges are used purely to connect vertices and do not carry i.e. properties themselves.
See Database for the database lifecycle, URL format, and how to create one.