Apache Tika maxStringLength maximum length reached

I have thousands of PDF documents that are 11-15 mb. My program says that my document contains more than 100 thousand characters.

Error output:

An exception in the stream "main" org.apache.tika.sax.WriteOutContentHandler $ WriteLimitReachedException: Your document contains more than 100,000 characters, and therefore your requested limit has been reached. Get the full text of the document, increase your limit.

How can I increase the limit to 10-15 mb?

I found a solution that is a new class of Tika facades, but I could not find a way to integrate it with mine.

Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); 

Here is my code:

  BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf"; FileInputStream inputstream = new FileInputStream(location); ParseContext pcontext = new ParseContext(); PDFParser pdfparser = new PDFParser(); pdfparser.parse(inputstream, handler, metadata, pcontext); 

Output:

 System.out.println("Content of the PDF :" + pcontext); 
+5
source share
1 answer

Using

 BodyContentHandler handler = new BodyContentHandler(-1); 

to disable the limit. From Javadoc :

The internal line buffer is limited for a given number of characters. If this write limit is reached, then a SAXException throw exception is thrown.
Parameters: writeLimit - maximum number of characters to include in the string or -1 to disable write restriction

+12
source

Source: https://habr.com/ru/post/1243536/


All Articles