How to read a group of shapes as an image from a Word document (.doc or .docx) using apachePOI?

I have a simple requirement for extracting all images and diagrams in a MS Word file. I can only extract images, but not a group of shapes (for example, an example usage diagram or activity diagram). I want to save all charts as an image.

I used apachePOI.

Following the code, I wrote

public class worddocreader { public static void main(String args[]) { FileInputStream fis; try { FileInputStream fs = new FileInputStream("F:/1.docx"); XWPFDocument docx = new XWPFDocument(fs); List<XWPFPictureData> piclist = docx.getAllPictures(); Iterator<XWPFPictureData> iterator = piclist.iterator(); int i = 0; while (iterator.hasNext()) { XWPFPictureData pic = iterator.next(); byte[] bytepic = pic.getData(); BufferedImage imag = ImageIO.read(new ByteArrayInputStream( bytepic)); ImageIO.write(imag, "image/jpeg", new File("F:/docParsing/imagefromword" + i + ".jpg")); i++; } ArrayList<PackagePart> packArrayList = docx.getPackageRelationship().getPackage().getParts(); int size = packArrayList.size(); System.out.println("Array List Size : " + packArrayList.size()); while (size-->0) { PackagePart packagePart = packArrayList.get(size); System.out.println(packagePart.getContentType()); try{ BufferedImage bfrImage = ImageIO.read(packagePart.getInputStream()); ImageIO.write(bfrImage,"image/png",new File("F:/docParsing_emb/size"+size+".png")); }catch(Exception e){ e.printStackTrace(); } } System.out.println("Done"); } catch (Exception e) { e.printStackTrace(); } } 

}

It only extracts images, not shapes.

Does anyone know how to do this?

+6
source share
2 answers

So, you are after the material specified in [MS-ODRAW] , i.e. the so-called OfficeDrawings that can be created directly in Word using the Drawing palette?

Unfortunately, the POI offers a little help here. With HWPF (old binary * .doc file) you can get a handle to such data, for example:

 HWPFDocument document; OfficeDrawings officeDrawings = document.getOfficeDrawingsMain(); OfficeDrawing drawing = officeDrawings.getOfficeDrawingAt(OFFSET); // OFFSET is a global character offset describing the position of the drawing in question // ie document.getRange().getStartOffset() + x 

This drawing can then be further processed into separate entries:

 EscherRecordManager escherRecordManager = new EscherRecordManager(drawing.getOfficeArtSpContainer()); EscherSpRecord escherSpRecord = escherRecordManager.getSpRecord(); EscherOptRecord escherOptRecord = escherRecordManager.getOptRecord(); 

Using the data from all of these records, you can theoretically display the original drawing again. But it's pretty painful ...

So far, I have only done this in one case, when I had a lot of simple arrows floating on the page. They should have been converted to a textual representation (something like: "Positions (x1, y1) and (x2, y2) are connected by an arrow"). Basically, this meant implementing a subset of [MS-ODRAW] related to these arrows using the above entries. Not a really nice task.

Backup MS Word

If using MS Word in itself is an option for you, then there is another pragmatic way:

  • Retrieve all relevant offsets containing OfficeDrawings using the POI.
  • Inside Word: navigate through the document using VBA and copy all the pictures with the given offsets to the clipboard.
  • Use another application (I chose Visio) to upload the contents of the clipboard to PNG.

The necessary verification of the drawing in step 1 is very simple (see below). The rest can be fully automated in Word. If anyone needs this, I can share the appropriate VBA code.

 if (characterRun.isSpecialCharacter()) { for (char currentChar : characterRun.text().toCharArray()) { if ('\u0008' == currentChar) return true; } } 
+1
source

If you mean Office Art objects, then

In the class org.apache.poi.hwpf.HWPFDocument there is _officeDrawingsMain that contains office art objects

check this link https://poi.apache.org/apidocs/org/apache/poi/hwpf/HWPFDocument.html

0
source

Source: https://habr.com/ru/post/971549/


All Articles