Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[no sq] Split blockpos into 3 different columns in sqlite3 map database #15768

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sfan5
Copy link
Collaborator

@sfan5 sfan5 commented Feb 8, 2025

This changes the map database schema for sqlite to be (x, y, z, data) instead of (pos, data).

Advantages:

  • better range queries enables possible optimizations inside Luanti and for external tools
  • moving away from the "block as integer" code that is apparently so cursed

Disadvantages:

  • If this SO answer is to be believed, this might actually be minimally slower

To do

This PR is a Work in Progress

  • add simple unit tests
  • update docs

How to test

  1. create new world, inspect map.sqlite afterwards
  2. join old world (it should work)
  3. try migrating an old world to e.g. leveldb and back to sqlite (should change to new format)

@sfan5 sfan5 added @ Server / Client / Env. Feature ✨ PRs that add or enhance a feature labels Feb 8, 2025
@appgurueu
Copy link
Contributor

arbitrary range queries enables possible optimizations inside Luanti and for external tools

Note that this index will only allow SQLite to speed up the following (parts of) a query:

  • x BETWEEN ? AND ?
  • x = ? AND y BETWEEN ? AND ?
  • x = ? AND y = ? AND z BETWEEN ? AND ?

That is, a range query can essentially only be done along a single axis, and all preceding axes must be equality comparisons (or SQLite will effectively have to loop over all possible values for each), in the order the keys appear in the index.

@sfan5
Copy link
Collaborator Author

sfan5 commented Feb 8, 2025

We can add more indexing if needed tho.

Edit: we can however influence the order of the default index. (X, Z, Y) sounds most useful so I'll change that.

@Zughy Zughy added the Roadmap: supported by core dev PR not adhering to the roadmap, yet some core dev decided to take care of it label Feb 9, 2025
@lhofhansl
Copy link
Contributor

While you are at it, we could also split out a block's timestamp into a separate column.
This timestamp is a frequent cause for writing entire blocks back to the db, even when nothing else has changed.

@lhofhansl
Copy link
Contributor

lhofhansl commented Feb 9, 2025

Note that this index will only allow SQLite to speed up the following (parts of) a query

If all our databases support z-order we can do approximate spatial indexing that way. (It's not hard to add it via plpgsql to Postgres for example, and then index on a function.)

On the whole I would expect this change to slow things, but that's just a gut feeling.

@sorcerykid
Copy link
Contributor

Overall I'm in favour of this PR, as it will significantly improve performance of my map analysis scripts, in particularly RocketLib Toolkit which includes many functions for iterating mapblocks in certain areas.

https://bitbucket.org/sorcerykid/rocketlib/src/master/

Currently, I have to resort to generating a "Cache" database that just consists of a cross-reference table for indexing Block Position (X, Y, Z) to Block ID. This process can be extremely time-consuming, particularly on a map.sqlite that is several gigabytes. It would be so nice to finally eliminate that step entirely.

My map viewer application also makes extensive use of position queries for rendering and navigating a density map along the X, Y, or Z axis. Currently, I have to scan the entire database and manually total up the selected mapblocks. But with separate columns for Block Position, this could be streamlined.

image

@sfan5
Copy link
Collaborator Author

sfan5 commented Feb 10, 2025

While you are at it, we could also split out a block's timestamp into a separate column.
This timestamp is a frequent cause for writing entire blocks back to the db, even when nothing else has changed.

Would make sense to do at once, but extracting out the timestamp has a whole other bunch of different implications.

If all our databases support z-order we can do approximate spatial indexing that way.

I couldn't find much info on native support in databases for this but also I don't really see any use case.
Letting the database sort through the x,y,z columns should be fast enough if we need it.

@sorcerykid
Copy link
Contributor

This timestamp is a frequent cause for writing entire blocks back to the db, even when nothing else has changed.

Can you point to where this happens? I'm genuinely curious, because I was under the impression mapblocks were only written back to the database if a flag was set for one or more changes (aside from just a new timestamp). After all, if a mapblock is already loaded in memory, then I would think the timestamp is also in memory. So It seems the solution is simply not to write the mapblock to the database if only the timestamp has changed.

@sfan5
Copy link
Collaborator Author

sfan5 commented Feb 10, 2025

Can you point to where this happens?

/*
Handle removed blocks
*/
// Convert active objects that are no more in active blocks to static
deactivateFarObjects(false);
for (const v3s16 &p: blocks_removed) {
MapBlock *block = m_map->getBlockNoCreateNoEx(p);
if (!block)
continue;
// Set current time as timestamp (and let it set ChangedFlag)
block->setTimestamp(m_game_time);
}

Note that this only marks the block to be written when unloaded, not immediately.
One reason I can think of why we need this is the LBM dtime_s parameter.

Someone could go and test how many byte of disk writes it would save if we could update the timestamp independently, for a typical SP or MP session.

@sfan5 sfan5 marked this pull request as ready for review February 10, 2025 19:03
@lhofhansl
Copy link
Contributor

lhofhansl commented Feb 11, 2025

Created two worlds with same seed et al. Seen no measurable degradation in load speed. Good!
Also tried to access old db, works as expected.

@lhofhansl
Copy link
Contributor

Someone could go and test how many byte of disk writes it would save if we could update the timestamp independently, for a typical SP or MP session.

Just rememberd f1349be :), before we marked every loaded block dirty. Now it's just unloaded active blocks.

@lhofhansl
Copy link
Contributor

lhofhansl commented Feb 11, 2025

Now that I am looking a bit more, check this one: https://www.sqlite.org/lang_createtable.html#rowid, especially:

A PRIMARY KEY column only becomes an integer primary key if the declared type name is exactly "INTEGER". Other integer type names like "INT" or "BIGINT" or "SHORT INTEGER" or "UNSIGNED INTEGER" causes the primary key column to behave as an ordinary table column with integer affinity and a unique index, not as an alias for the rowid.

It seems all this time we had declared the sqlite table such that it had two INT keys. (Because we used INT instead of INTEGER to declared the PK).

And when I change that INTEGER and create a new world (with same seed) I see that pure DB load time of a block goes from 6-7us to about 5-6us. In the scheme of things it won't make a visible difference, though.

@sfan5
Copy link
Collaborator Author

sfan5 commented Feb 11, 2025

It seems all this time we had declared the sqlite table such that it had two INT keys. (Because we used INT instead of INTEGER to declared the PK).

Wow. I can't say I'm a fan of all the subtle behaviors SQLite has (often justified with backwards compatibility).

@lhofhansl
Copy link
Contributor

lhofhansl commented Feb 12, 2025

More of the sqlite quirkiness... According to this https://www.sqlite.org/rowidtable.html

The PRIMARY KEY constraint for a rowid table (as long as it is not the true primary key or INTEGER PRIMARY KEY) is really the same thing as a UNIQUE constraint. Because it is not a true primary key, columns of the PRIMARY KEY are allowed to be NULL, in violation of all SQL standards.

So:

CREATE TABLE `blocks` (
    `x` INT, `y` INT, `z` INT,
    `data` BLOB NOT NULL,
    PRIMARY KEY (`x`, `z`, `y`)
);

Is the same as:

CREATE TABLE `blocks` (
    // the implicit rowid INTEGER PRIMARY KEY,
    `x` INT, `y` INT, `z` INT,
    `data` BLOB NOT NULL,
    UNIQUE  (`x`, `z`, `y`)
);

And if so, we can replace the implicit rowid column with our pos column and maintain as before:

CREATE TABLE `blocks` (
    'pos' INTEGER PRIMARY KEY,
    `x` INT, `y` INT, `z` INT,
    `data` BLOB NOT NULL,
    UNIQUE  (`x`, `z`, `y`)
);

This way we get the benefit of the x, y, z columns and backwards compatibility. Need to be tested of course!
(And well, of course, an old server writing would now mess up our database, since it won't write the x,y, and z columns.)

And according to this: https://www.sqlite.org/lang_createtable.html it's identical to:

CREATE TABLE `blocks` (
    'pos' INTEGER PRIMARY KEY,
    `x` INT, `y` INT, `z` INT,
    `data` BLOB NOT NULL
);
CREATE UNIQUE INDEX xyz ON blocks(x, y, z);

This also makes it clear that by adding the x, y, and z columns we essentially have just added an index.

@sfan5
Copy link
Collaborator Author

sfan5 commented Feb 12, 2025

I'm sure that could be considered an elegant solution (by some measure), but I really don't want to make the "is this a new format or old format map" decision on a row-basis. It just complicates things for everyone and also literally breaks range queries.

@lhofhansl
Copy link
Contributor

lhofhansl commented Feb 13, 2025

Looked at some DB sizes. Sames seed, and camera position and direction. viewing_range = 1000, mostly an open scene with some ocean. Loads about 8000 blocks.

  • Master: 33.2MB
  • This PR: 33.2MB - no change in size interestingly and surprisingly
  • Master with INT replaced by INTEGER: 26.8MB - this saves the extra index, 20% space. This is actually what I'd prefer.
  • This PR (with WITHOUT ROWID) added: 33.2MB - so that doesn't help at all.

sqlite seems to do well with this change.

So while I'd prefer the old format (with the extra useless index removed for 20% space saving), I guess I'll be the only one. At least this one doesn't increase the size, and performance is on par as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature ✨ PRs that add or enhance a feature Roadmap: supported by core dev PR not adhering to the roadmap, yet some core dev decided to take care of it @ Server / Client / Env.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants