How to reduce memory stick usage?

I use wand and pytesseract to get the text of pdf files uploaded to the django site, for example:

image_pdf = Image(blob=read_pdf_file, resolution=300) image_png = image_pdf.convert('png') req_image = [] final_text = [] for img in image_png.sequence: img_page = Image(image=img) req_image.append(img_page.make_blob('png')) for img in req_image: txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB')) final_text.append(txt) return " ".join(final_text) 

It works for me in celery on a separate ec2 server. However, since image_pdf grows to about 4 gigabytes even for a 13.7 mb file, the killer stops it. Instead of paying for a higher drum, I want to try to reduce the memory used by the stick and ImageMagick. Since this is already async, I do not mind increasing the computational time. I shot this: http://www.imagemagick.org/Usage/files/#massive , but not sure if it can be implemented with a wand. Another possible fix is ​​a way to open a PDF file in a wand one page at a time, instead of immediately placing the full image in RAM. Alternatively, how could I interact directly with ImageMagick using python to use these memory restriction methods?

+3
source share
3 answers

Remember that the library integrates with the MagickWand API and, in turn, delegates the work of encoding / decoding PDF ghostscript . Both MagickWand and ghostscript allocated additional memory resources, and it is best to free them at the end of each task. However, if the routines are initialized by Python and stored in a variable, it is more than possible to introduce memory leaks.

Here are some tips to ensure proper memory management.

  1. Use with context management for all Wand assignments. This will ensure that all resources pass through the __enter__ management __enter__ & __exit__ .

  2. Avoid creating blob for data transfer. When creating a BLOB file format file, MagickWand will allocate additional memory for copying and encoding the image, and python will contain the resulting data in addition to the original wand instance. Usually suitable for a development environment, but can quickly get out of hand in a production environment.

  3. Avoid Image.sequence . This is another subroutine that requires a large number of copies, which causes python to contain a bunch of memory resources. Remember, ImageMagick manages image stacks very well, so if you don't reorder / manipulate individual frames, it is best to use MagickWand methods without using Python.

  4. Each task must be an isolated process and can be fully completed upon completion. This should not be a problem for you with celery as a queue worker, but you should double check the flow / worker configuration + documents.

  5. Beware of permission. A PDF resolution of 300 @ 16Q will produce a huge bitmap. When using many OCR methods (tesseract / opencv), the first step is to preprocess the incoming data to remove redundant / unnecessary colors / channels / data / & tc.

Here is an example of how I would approach this. Note that I will use to directly control the image stack without additional Python resources.

 import ctyles from wand.image import Image from wand.api import library # Tell wand about C-API method library.MagickNextImage.argtypes = [ctypes.c_void_p] library.MagickNextImage.restype = ctypes.c_int # ... Skip to calling method ... final_text = [] with Image(blob=read_pdf_file, resolution=100) as context: context.depth = 8 library.MagickResetIterator(context.wand) while(library.MagickNextImage(context.wand) != 0): data = context.make_blob("RGB") text = pytesseract.image_to_string(data) final_text.append(text) return " ".join(final_text) 

Of course, your mileage may vary. If you are comfortable with , you can run gs & tesseract directly and remove all Python shells.

+2
source

The code from @emcconville works and my temporary folder is no longer populated with magick- * files

I needed to import ctypes, not cstyles

I also got the error mentioned by @kerthik

decided by saving the image and loading it again, it is also possible to save it to memory

 from PIL import Image as PILImage ... context.save(filename="temp.jpg") text = pytesseract.image_to_string(PILImage.open("temp.jpg"))' 

EDIT I ​​found a memory conversion on How to convert wand.image.Image to PIL.Image?

 img_buffer = np.asarray(bytearray(context.make_blob(format='png')),dtype='uint8') bytesio = io.BytesIO(img_buffer) text = ytesseract.image_to_string(PILImage.open(bytesio),lang="dan") 
0
source

I also suffered from memory leak problems. After some research and code tuning, my problems were resolved. I basically worked correctly, using both the destroy () function .

In some cases, I could use c to open and read files, as in the example below:

 with Image(filename = pdf_file, resolution = 300) as pdf: 

In this case, using with, the memory and tmp files are properly managed.

In another case, I had to use the destroy () function, preferably inside the try / finally block, as shown below:

 try: for img in pdfImg.sequence: # your code finally: img.destroy() 

The second case, this is an example where I can not use with , because I had to iterate over the pages in sequence , so I already had the file open and I iterate your pages.

This combination of solution solved my problems with memory leaks.

0
source

Source: https://habr.com/ru/post/986741/


All Articles