SPARQL query with multiple aggregates exceeds memory limit

Question

SPARQL query with multiple aggregates exceeds memory limit

I am trying to create some user statistics from a triple repository using SPARQL. See Request below. How can this be improved? Am I doing something here? Why does it consume so much memory? (see backstory at the end of this post)

I prefer to do aggregation and integrate everything inside a three-dimensional storage. Splitting the query would mean that I had to join the results “manually”, outside the database, losing the efficiency and optimization of the triple store. There is no need to reinvent the wheel for no good reason.

Request

SELECT ?person (COUNT(DISTINCT ?sent_email) AS ?sent_emails) (COUNT(DISTINCT ?received_email) AS ?received_emails) (COUNT(DISTINCT ?receivedInCC_email) AS ?receivedInCC_emails) (COUNT(DISTINCT ?revision) AS ?commits) WHERE { ?person rdf:type foaf:Person. OPTIONAL { ?sent_email rdf:type email:Email. ?sent_email email:sender ?person. } OPTIONAL { ?received_email rdf:type email:Email. ?received_email email:recipient ?person. } OPTIONAL { ?receivedInCC_email rdf:type email:Email. ?receivedInCC_email email:ccRecipient ?person. } OPTIONAL { ?revision rdf:type vcs:VcsRevision. ?revision vcs:committedBy ?person. } } GROUP BY ?person ORDER BY DESC(?commits)

Background

The problem is that I get the error message "QUERY MEMORY LIMIT REACHED" in AllegroGraph (see also my related SO question ), since the repository contains only about 200 thousand Trojas that fit easily into the input file (ntriples) approx. 60 MB, I'm wondering how more than 4 GB of RAM is required to complete the query results, which is about two orders of magnitude higher.

+4

optimization sparql

cyroxx Nov 23 '12 at 16:16

source share

1 answer

enridaga · Accepted Answer · 2014-07-08T13:00:22+0000

Try splitting the calculation in subqueries, for example:

 SELECT ?person (MAX(?sent_emails_) AS ?sent_emails_) (MAX(?received_emails_ AS ?received_emails_) (MAX(?receivedInCC_emails_ AS ?receivedInCC_emails_) (MAX(?commits_) AS ?commits) WHERE { { SELECT ?person (COUNT(DISTINCT ?sent_email) AS ?sent_emails_) (0 AS ?received_emails_) (0 AS ?commits_) WHERE { ?sent_email rdf:type email:Email. ?sent_email email:sender ?person. ?person rdf:type foaf:Person. } GROUP BY ?person } union { (similar pattern for the others) .... } } GROUP BY ?person ORDER BY DESC(?commits)

The goal is as follows:

avoid generating a huge number of rows in the result set that needs to be processed for aggregation
avoid using OPTIONAL {} patterns, which should also affect performance

SPARQL query with multiple aggregates exceeds memory limit

More articles: