How to read a text document with bold and italic formatting using POI

Question

How to read a text document with bold and italic formatting using POI

I am using Apache POI.

I can read text from a doc file using "org.apache.poi.hwpf.extractor.WordExtractor"

Even fetching tables using "org.apache.poi.hwpf.usermodel.Table"

But please suggest me how can I get bold / italic text formatting.

Thanks in advance.

+4

doc apache-poi italic bold hwpf

Sudeep nayak Jun 05 '13 at 10:40

source share

2 answers

Instead of using WordExtractor, you can read Range :

 ... HWPFDocument doc = new HWPFDocument(fis); Range r = doc.getRange(); ...

Range is the central class of this model. When you get a range, you can play more with the features of the texts and, for example, iterate over all CharacterRuns characters and check if it is italic (.isItalic ()) or change to Italic: (.setItalic (true)).

 for(int i = 0; i<r.numCharacterRuns(); i++) { CharacterRun cr = r.getCharacterRun(i); cr.setItalic(true); ... } ... File fon = new File(yourFilePathOut); FileOutputStream fos = new FileOutputStream(fon); doc.write(fos); ...

This works if you use HWPF. Meanwhile, framing and working with the Paragraph concept is more convenient.

+1

Darius miliauskas Oct 27 '15 at 12:10

source share

Gagravarr · Accepted Answer · 2013-06-05T15:25:07+0000

WordExtractor returns only text, nothing more.

The easiest way to get text + formatting a dictionary document is to upgrade to Apache Tika . Apache Tika is built on top of the Apache POI (among others) and offers both plain text extraction and rich extraction (XHTML with formatting).

Alternatively, if you want to write code yourself, I suggest you look at the code in Tika WordExtractor , which demonstrates how to use Apache POI to get information about formatting text runs.

How to read a text document with bold and italic formatting using POI

More articles: