I am trying to create some user statistics from a triple repository using SPARQL. See Request below. How can this be improved? Am I doing something here? Why does it consume so much memory? (see backstory at the end of this post)
I prefer to do aggregation and integrate everything inside a three-dimensional storage. Splitting the query would mean that I had to join the results βmanuallyβ, outside the database, losing the efficiency and optimization of the triple store. There is no need to reinvent the wheel for no good reason.
Request
SELECT ?person (COUNT(DISTINCT ?sent_email) AS ?sent_emails) (COUNT(DISTINCT ?received_email) AS ?received_emails) (COUNT(DISTINCT ?receivedInCC_email) AS ?receivedInCC_emails) (COUNT(DISTINCT ?revision) AS ?commits) WHERE { ?person rdf:type foaf:Person. OPTIONAL { ?sent_email rdf:type email:Email. ?sent_email email:sender ?person. } OPTIONAL { ?received_email rdf:type email:Email. ?received_email email:recipient ?person. } OPTIONAL { ?receivedInCC_email rdf:type email:Email. ?receivedInCC_email email:ccRecipient ?person. } OPTIONAL { ?revision rdf:type vcs:VcsRevision. ?revision vcs:committedBy ?person. } } GROUP BY ?person ORDER BY DESC(?commits)
Background
The problem is that I get the error message "QUERY MEMORY LIMIT REACHED" in AllegroGraph (see also my related SO question ), since the repository contains only about 200 thousand Trojas that fit easily into the input file (ntriples) approx. 60 MB, I'm wondering how more than 4 GB of RAM is required to complete the query results, which is about two orders of magnitude higher.
source share