Cython: How to move large objects without copying them?

I use Cython to port code in C ++ and expose it in Python for interactive work. My problem is that I need to read large graphics (a few gigabytes) from a file, and they end up getting doubled in memory. Can someone help me diagnose and solve this problem?

My Cython wrapper for the graph class is as follows:

cdef extern from "../src/graph/Graph.h": cdef cppclass _Graph "Graph": _Graph() except + _Graph(count) except + count numberOfNodes() except + count numberOfEdges() except + cdef class Graph: """An undirected, optionally weighted graph""" cdef _Graph _this def __cinit__(self, n=None): if n is not None: self._this = _Graph(n) # any _thisect which appears as a return type needs to implement setThis cdef setThis(self, _Graph other): #del self._this self._this = other return self def numberOfNodes(self): return self._this.numberOfNodes() def numberOfEdges(self): return self._this.numberOfEdges() 

If you want to return the Python Graph, you need to create it empty, and then the setThis method setThis used to install the native _Graph . This happens, for example, when a Graph read from a file. This is the job of this class:

 cdef extern from "../src/io/METISGraphReader.h": cdef cppclass _METISGraphReader "METISGraphReader": _METISGraphReader() except + _Graph read(string path) except + cdef class METISGraphReader: """ Reads the METIS adjacency file format [1] [1]: http://people.sc.fsu.edu/~jburkardt/data/metis_graph/metis_graph.html """ cdef _METISGraphReader _this def read(self, path): pathbytes = path.encode("utf-8") # string needs to be converted to bytes, which are coerced to std::string return Graph(0).setThis(self._this.read(pathbytes)) 

Interactive use is as follows:

  >>> G = graphio.METISGraphReader().read("giant.metis.graph") 

After reading from a file and using X-GB memory, there is a phase where copying is explicitly performed, and after that 2-GB memory is used. All memory is freed when del G called.

Where is my mistake that leads to copying and existing graphics in memory?

+6
source share
3 answers

I have no final answer for you, but I have a theory.

The Cython shells you wrote are unusual because they wrap a C ++ object directly, rather than a pointer to it.

The following code is particularly inefficient:

 cdef setThis(self, _Graph other): self._this = other return self 

The reason is because your _Graph class contains several STL vectors and you need to copy them. That way, when your other object is assigned to self._this , memory usage is effectively doubled (or, worse, because STL allocators can sum for performance reasons).

I wrote a simple test that matches yours and added all the information to the log to see how objects are created, copied or destroyed. I can’t find any problems there. Copies occur, but after completing the assignment, I see that only one object remains.

So my theory is that the extra memory you see is related to the STL allocator logic in vectors. All this extra memory must be attached to the final object after the copies.

My recommendation is that you switch to more standard pointer-based packaging. Your _Graph wrapper should be defined more or less as follows:

 cdef class Graph: """An undirected, optionally weighted graph""" cdef _Graph* _this def __cinit__(self, n=None): if n is not None: self._this = new _Graph(n) else: self._this = 0 cdef setThis(self, _Graph* other): del self._this self._this = other return self def __dealloc__(self): del self._this 

Please note that I need to delete _this because it is a pointer.

Then you need to change your METISGraphReader::read() method to return the selected Graph heap. The prototype of this method should be changed to:

 Graph* METISGraphReader::read(std::string path); 

Then the Cython shell for it can be written as:

  def read(self, path): pathbytes = path.encode("utf-8") # string needs to be converted to bytes, which are coerced to std::string return Graph().setThis(self._this.read(pathbytes)) 

If you do this, there is only one object that is created on the read() heap. A pointer to this object is returned to the read() shell of Cython, which then sets it to a new instance of Graph() . The only thing that is copied is 4 or 8 bytes of the pointer.

Hope this helps!

+3
source

You will need to change the C ++ class to store data through shared_ptr. Make sure you have the correct copy constructor and assignment operator:

 #define _CRT_SECURE_NO_WARNINGS #include <stdio.h> #include <memory> struct Data { // your graph data Data(const char* _d = NULL) { if (_d) strncpy(d, _d, sizeof(d)-1); else memset(d, 0, sizeof(d)); } Data(const Data& rhs) { memcpy(d, rhs.d, sizeof(d)); } ~Data() { memset(d, 0, sizeof(d)); } void DoSomething() { /* do something */ } // a public method that was used in Python char d[1024]; }; class A { // the wrapper class public: A() {} A(const char* name) : pData(new Data(name)) {} A(const A& rhs) : pData(rhs.pData) {} A& operator=(const A& rhs) { pData = rhs.pData; return *this; } ~A() {} // interface with Data void DoSomething() { if (pData.get() != NULL) pData->DoSomething(); } private: std::shared_ptr<Data> pData; }; int main(int argc, char** argv) { A o1("Hello!"); A o2(o1); A o3; o3 = o2; return 0; } 
0
source

If your limitation / goal is "Calculate on graphs with billions of edges in a reasonable amount of time on one PC." , consider refactoring to use GraphChi .

If single-machine / in-memory is not a limitation, consider using a graph database, such as Neo4j , instead of pulling all the data into memory. There are also graphical APIs that overlay Hadoop (such as Apache Giraph ).

-3
source

Source: https://habr.com/ru/post/958571/


All Articles