Scrap structured information from hundreds of Word documents?

I was tasked with extracting some structured information from hundreds of human-readable documents (mainly from MS Word) and putting them in a database. The data is largely embedded in tables throughout the document, but there is a lot of text between the tables, and although the documents are very similar in structure, there are several differences. Documents change quite often (we get an updated version every few months)

So far, the only viable option that I can think of is to manually iterate over all the documents and insert / update the information, but I thought I would ask here if anyone thinks that you can somehow clear the documents?

Oh, and the data should be fairly correct ...

+3
source share
1 answer

I did a similar job (without tables) using a converter from RTF to FO .

You convert documents to RTF and then to FO, which gives you a good XML document structure. Then you can easily parse and clear the data.

+2
source

Source: https://habr.com/ru/post/1775372/


All Articles