On buckets, cpu cores, select query performances and isolation levels #994

vic0824 · 2023-04-02T09:44:43Z

vic0824
Apr 2, 2023

When a Document Type is created, by default ArcadeDB created as many buckets as CPU cores.
I have read that this speeds up parallel insertions, because two insert queries can be performed by two different threads that write to different buckets, without creating contention. This is easy to understand if the bucket selection strategy is round-robin, because we can be sure that two consecutive inserts will write to two different buckets. If the strategy is partitioned, it is less easy to understand: can we still be sure that two consecutive inserts will write to different buckets? If the strategy is thread, it's not easy to see how it works, I guess one needs to know the implementation details.
What I don't understand is if something similar happens with select queries: if a select query is performed, does the engine scan in parallel all the buckets (assuming the type was created with the default options of 1 bucket per core)? If the insertions are guaranteed to be performed always from the same thread, would the select performances increase if the type was created with only 1 bucket (which means that the engine has to open only 1 file instead of n?) Or the read performances are always better if multiple buckets are used?
Finally, does it make sense to talk about isolation levels in ArcadeDB? In section 8.5. Batch of the manual, it is specified that a batch can be executed with READ_COMMITTED or REPEATABLE_READ isolation levels, but isolation levels are not discussed anywhere else in the document, so it is not clear what is the default isolation level for normal queries (select, insert, update, delete).
The examples in the Isolation section seems to suggest that the default isolation level is REPEATABLE_READ, but this is not explicitly stated anywhere in the document.
In my project, I have several clients that need to execute some tasks stored in the database, and I have to make sure that only one client "books" a single task. With a relational database, I would execute the transaction (select the next available task and change its status to signify that it has been booked) with SERIALIZABLE level, to avoid that two different threads book the same task, but with ArcadeDB I have simulated this by explicitly synchronizing the read and write threads on to a common object. I wonder if there is a "native" way of doing this with ArcadeDB?

lvca · 2023-04-03T19:19:36Z

lvca
Apr 3, 2023
Maintainer

@vic0824 All good points, we should expand our docs with such information!

Trying to explain better in this thread, and perhaps updating the documentation after that.

About the bucket selection strategy, the round-robin picks the next bucket from the list. After the last it starts from the first one again. This is the default strategy but doesn't work well with concurrent multi-threading insertion where the best strategy is thread. With thread, every thread picks the bucket based on the own thread id (bucket = threadId % number-of-buckets). With partitoned(<property>), the bucket is determined by the property value itself. For example, if you select partitioned(id), where id is a number, then the selected bucket will be bucket = id % number-of-buckets. Partitioned is the preferred method with fast lookups, because ArcadeDB doesn't have to look into all the indexes (one per bucket), but rather go into the specific index relative to the bucket.

To summarize:

round-robin, the default, is good for mixed operations
thread works best when parallel insertion is executed by multiple thread eliminating/reducing contention between pages
partitioned works best when you need a faster lookup from an index

About the transaction isolation level, I double checked the code and the docs about transaction batch is not correct. That piece was coming from OrientDB that allowed some control over isolation. With ArcadeDB the level is always Read committed, that means dirty reads are not possible, but you can get Nonrepeatable Read and Phantom Read. It would be pretty easy to support both REPEATABLE_READ and SERIALIZABLE, I'm opening an issue for that.

2 replies

lvca Apr 4, 2023
Maintainer

Created issue #1000 for Configurable Transaction Isolation Level

lvca Apr 6, 2023
Maintainer

This was implemented in #1000 supporting REPEATABLE_READ isolation level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On buckets, cpu cores, select query performances and isolation levels #994

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

On buckets, cpu cores, select query performances and isolation levels #994

vic0824 Apr 2, 2023

Replies: 1 comment · 2 replies

lvca Apr 3, 2023 Maintainer

lvca Apr 4, 2023 Maintainer

lvca Apr 6, 2023 Maintainer

vic0824
Apr 2, 2023

Replies: 1 comment 2 replies

lvca
Apr 3, 2023
Maintainer

lvca Apr 4, 2023
Maintainer

lvca Apr 6, 2023
Maintainer