SPARQL query to select / build the latest version from RDF data

I have an RDF file that is used to track changes to elements. Using this data, I can track changes made to an element during its existence. After changing specific information, the relevant data is placed as a new revision. Look ..

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix mymeta: <http://www.mymeta.com/meta/> . @prefix dc: <http://purl.org/dc/elements/1.1/> . <urn:ITEMID:12345> rdf:type mymeta:item . <urn:ITEMID:12345> mymeta:itemchange <urn:ITEMID:12345:REV-1> . <urn:ITEMID:12345:REV-1> dc:title "Product original name"@en . <urn:ITEMID:12345:REV-1> dc:issued "2006-12-01"@en . <urn:ITEMID:12345:REV-1> dc:format "4 x 6 x 1 in"@en . <urn:ITEMID:12345:REV-1> dc:extent "200"@en . <urn:ITEMID:12345> rdf:type mymeta:item . <urn:ITEMID:12345> mymeta:itemchange <urn:ITEMID:12345:REV-2> . <urn:ITEMID:12345:REV-2> dc:title "Improved Product Name"@en . <urn:ITEMID:12345:REV-2> dc:issued "2007-06-01"@en . 

In accordance with this data, the article "2007-06-01" was changed, where only the name of the element was changed to "Improved product name". As you can see, "dc: format" and "dc: extent" are missing in the latest version of the data. This is necessary to avoid millions of duplicate entries!

I can write a SPARQL query that shows me the latest product version information (REV-2: dc: title and dc: issued), but its missing "dc: format" and "dc: extent", which I want to transfer from the latest version (REV-1).

How can I write a SPARQL query for this? Any help is much appreciated!

+4
source share
3 answers

Not sure if you can do this in one request. I will think about it if I can, but the following two queries can start working in the right direction:

1) Find changes that do not have a format

 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX mymeta: <http://www.mymeta.com/meta/> PREFIX dc: <http://purl.org/dc/elements/1.1/> DESCRIBE ?change WHERE { ?item a mymeta:item; mymeta:itemchange ?change. ?change ?p ?o. OPTIONAL { ?change dc:format ?format . } FILTER (!bound(?format)) } 

2) I think this will find the oldest change that has the format

 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX mymeta: <http://www.mymeta.com/meta/> PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT DISTINCT ?format WHERE { ?item a mymeta:item; mymeta:itemchange ?change. ?change dc:format ?format; dc:issued ?issued. OPTIONAL { ?moreRecentItem a mymeta:item; ?moreRecentItem dc:issued ?moreRecentIssued. FILTER (?moreRecentIssued > ?issued)} FILTER (?bound (?moreRecentIssued)) } 

With some additional work, it should be possible to limit the format (2) from these changes with the release date to the problem data of the result from (1). Therefore, for each line from (1), you must perform (2) to find the format value to use. You may have better results, but if you used rule-based reasoning rather than SPARQL. I would recommend EulerSharp or Pellet.

+2
source

For a single item, this is a fairly simple query using the SPARQL 1.1 subqueries. The trick is to order the changes that have this property by their date and accept the value from the latest version. The values form is used only to indicate the elements that you select. If you need to request additional elements, you can add them to the values block.

 prefix mymeta: <http://www.mymeta.com/meta/> prefix dc: <http://purl.org/dc/elements/1.1/> select ?item ?title ?format ?extent where { values ?item { <urn:ITEMID:12345> } #-- Get the title by examining all the revisions that specify a title, #-- ordering them by date, and taking the latest one. The same approach #-- is used for the format and extent. { select ?title { ?item mymeta:itemchange [ dc:title ?title ; dc:issued ?date ] . } order by desc(?date) limit 1 } { select ?format { ?item mymeta:itemchange [ dc:format ?format ; dc:issued ?date ] . } order by desc(?date) limit 1 } { select ?extent { ?item mymeta:itemchange [ dc:extent ?extent ; dc:issued ?date ] . } order by desc(?date) limit 1 } } 
 $ sparql --data data.n3 --query query.rq ---------------------------------------------------------------------------------- | item | title | format | extent | ================================================================================== | <urn:ITEMID:12345> | "Improved Product Name"@en | "4 x 6 x 1 in"@en | "200"@en | ---------------------------------------------------------------------------------- 

If you really need to do this for all elements, you can use a different subquery to select the elements. That is, instead of values ?item { ... } use:

 { select ?item { ?item a mymeta:item } } 

Although it was not mentioned in the original question, it appears in the comments , if you are interested in getting the latest property values ​​for all properties, you can subquery like the following, which is based on How to limit the size of a SPARQL solution group?

 select ?item ?property ?value { values ?item { <urn:ITEMID:12345> } ?item mymeta:itemchange [ ?property ?value ; dc:issued ?date ] #-- This subquery finds the earliest date for each property in #-- the graph for each item. Then, outside the subquery, we #-- retrieve the particular value associated with that date. { select ?property (max(?date_) as ?date) { ?item mymeta:itemchange [ ?property [] ; dc:issued ?date_ ] } group by ?item ?property } } 
 --------------------------------------------------------------- | item | property | value | =============================================================== | <urn:ITEMID:12345> | dc:issued | "2007-06-01"@en | | <urn:ITEMID:12345> | dc:title | "Improved Product Name"@en | | <urn:ITEMID:12345> | dc:extent | "200"@en | | <urn:ITEMID:12345> | dc:format | "4 x 6 x 1 in"@en | --------------------------------------------------------------- 
+1
source

I implemented this with RDF Quads, storing each revision inside a separate named graph and using the well-known named graph to keep track of the latest revision for each item along with all the changes.

The theory of your correction algorithm is currently flawed, since you do not have a method to determine the latest version, and you cannot easily track changes to find the last time the triple occurred. In addition, how do you know if a triple has been legally deleted in a revision if you are always trying to revert to previous versions to get a triple if you cannot find it in the latest version?

The RDF database should be able to limit the amount of duplication only by storing literals and URIs once and using pointers to create triples or quads. Perhaps you can make it work in the naive case, when everything is stored for each revision that you keep.

-1
source

Source: https://habr.com/ru/post/1302903/


All Articles