How to parse large datasets using RDFLib?

Question

How to parse large datasets using RDFLib?

I am trying to parse several large graphs using RDFLib 3.0, apparently it processes the first and dies in the second (MemoryError) ... it looks like MySQL is no longer supported as storage, can you suggest a way to analyze them somehow?

Traceback (most recent call last): File "names.py", line 152, in <module> main() File "names.py", line 91, in main locals()[graphname].parse(filename, format="nt") File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse location=location, file=file, data=data, **args) File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse parser.parse(source, self, **args) File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse parser.parse(f) File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse self.line = self.readline() File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline m = r_line.match(self.buffer) MemoryError

+6

python parsing graph rdf rdflib

user52028778 Apr 15 '11 at 14:59

source share

1 answer

Manuel salvadores · Accepted Answer · 2011-04-15T15:47:39+0000

How many triples are in these RDF files? I tested rdflib and it will not scale much further than a few tens of kilograms - if you are lucky. In no case does this work very well for files with millions of triples.

The best parser is the rapper from Redland Libraries . My first tip is not to use RDF/XML and switch to ntriples . Ntriples is a lighter format than RDF / XML. You can convert from RDF / XML to ntriples with rapper :

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

If you like Python, you can use Redland python bindings :

 import RDF parser=RDF.Parser(name="ntriples") model=RDF.Model() stream=parser.parse_into_model(model,"file://file_path", "http://your_base_uri.org") for triple in model: print triple.subject, triple.predicate, triple.object

I parsed rather large files (a couple of gigabytes) with red libraries without any problems.

After all, if you are working with large datasets, you may need to approve your data in a scalable triple storage, then I usually use 4store . 4store internally uses redland to parse RDF files. In the long run, I think for a scalable store three times you will need to do. And with this, you can use SPARQL to query your data and SPARQL / Update to insert and delete triples.

How to parse large datasets using RDFLib?

More articles: