Alternative C / C ++ for Apache Tika

I am looking for an alternative to C / C ++ for the Apache Tika Java-based infrastructure . In particular, I am looking for a file butcher and structured text extraction within a single structure. After doing some online searching and looking at the nearest object that I have, a GNU libextractor and several separate file filters that parse documents to extract text data (pdftoext, xls2csv..etc)

Can anyone recommend a good library comparable to Apache Tika?

thanks

+6
source share
2 answers

Tika has network server mode, so you can always start Tika using this and then send it from your C ++ code?

Alternatively, Tika has CLI mode, so you can start a new Tika process each time and read data from the channel.

+2
source

KDE provides a library called KFileMetaData which they use internally for the file indexer.

It uses C ++, Qt5 and supports most basic formats, such as ms-office-2007, odfs, pdfs, images, video, audio and ebooks.

+1
source

Source: https://habr.com/ru/post/889742/


All Articles