SPARQL: how to find similar rows?

I use Jena to query data stored in an ontology. Some objects are identified by a string, however, sometimes the same string is not available, as I process scanned documents and therefore there may be OCR errors. Therefore, I would like to find the most similar lines. Is there a way to use SPARQL for this purpose? Can I somehow calculate the levenshtein distance in SPARQL?

If this is not possible, I can still calculate the levenshtein distance in java. However, an efficient algorithm still needs to filter out irrelevant rows using SPARQL.

+4
source share
3 answers

SPARQL cannot do this directly, but you can implement the levenshtein distance function in java and use it in the SPARQL FILTER clause. Extensions in ARQ provide information about using extension functions.

+6
source

In case someone is interested, here is how I implemented it:

public class LevenshteinFilter extends FunctionBase2 { public NodeValue exec(NodeValue value1, NodeValue value2){ int i = StringUtils.getLevenshteinDistance(value1.asString(), value2.asString()); return NodeValue.makeInteger(i); } } 

using:

  String functionUri = "http://www.example.org/LevenshteinFunction"; FunctionRegistry.get().put(functionUri , LevenshteinFilter.class); String s = "..."; String sparql = "SELECT ?x WHERE { ?xa Something . " + "?x hasString ?str . " + "FILTER(<"+functionUri +">(?str, \"" + s + "\") < 5) }"; QueryExecution qexec = QueryExecutionFactory.create(sparql, model); ResultSet rs = qexec.execSelect(); while(rs.hasNext()){ ... } 
+4
source

For sesame fr/sparna/rdf/sesame/toolkit/functions/LevenshteinDistanceFunction , but cannot find the source.

0
source

Source: https://habr.com/ru/post/1404032/


All Articles