How to reduce the size of Jena's supported TDB dataset?

I work with a simple Jena dataset that has only one ~ 30 MB RDF file. As part of the application, I am trying to allow users to query the default graph (or named graph) and insert the resulting triples from the query into a new named graph. To do this, I use the CONSTRUCT statement to form the resulting set of triples in the RDF form, and then inserting these triples into a new model (using QueryExecution.execConstruct ()) and adding this model to the dataset. This seems to work again as the dataset gets a new node graph, and the disk size in the TDB database database is increasing.

The problem occurs when I try to remove a named graph from a dataset. Using the Dataset method removeNamedName ("graphName"), I remove the model from the dataset. Future queries by the name of this model indicate that it has been successfully deleted. However, the disk size in the TDB database database remains the same size, even after synchronization and shutdown.

At first, I thought that perhaps the database simply marks the deleted file space as free so that it can be overwritten as new data arrives, but that doesn't seem to be the case. If I delete the named graph and replace it immediately after the program starts in the same run, the folder does not seem to grow, but if I add a new named graph and delete it in the same run, the folder will become larger and the delete model does not free memory, which means that after several starts, the database folder is five or ten times larger than its original size, without saving any data.

Any insights or help would be great, thanks again.

+6
source share
1 answer

You can get more information by asking for the Jena mailing list ( users@jena.apache.org ), but I will try to answer. You can also see the TDB Architecture page on the website.

TDB stores the data, creating what it calls a Node table, which maps RDF nodes in 64-bit integer identifiers and vice versa. He then builds the individual indexes using these integer identifiers, which allow him to perform various database checks necessary to answer SPARQL queries.

Adding data potentially adds records to both of these structures (Node Table and Indexes), but deleting data only removes data from the indexes. Thus, over time, the Node table will continue to grow, even if you delete the old data because it is not deleted from the Node table.

The practical reasons for this are twofold:

  • Integer identifiers partially encode file offsets, therefore, searching for an identifier in Node is a quick scan of files, therefore, when deleting data, you cannot delete parts of the Node table without overwriting all Node identifiers i.e. Node table in the direction of ID → Node is a sequential file (helps to insert very quickly).
  • When data is deleted, you do not know if Node is used several times without performing a full database scan. Therefore, you cannot determine whether to delete the Node table entry first. The only viable way to do this is to implement a complete link counting scheme, which in itself will add complexity to the system and slow down the addition and removal.

Disclaimer - I am a member of the Jena project , but have never done any work personally on the TDB component, so this reflects my best understanding and may not be entirely accurate.

+6
source

Source: https://habr.com/ru/post/918388/


All Articles