I was tasked with extracting some structured information from hundreds of human-readable documents (mainly from MS Word) and putting them in a database. The data is largely embedded in tables throughout the document, but there is a lot of text between the tables, and although the documents are very similar in structure, there are several differences. Documents change quite often (we get an updated version every few months)
So far, the only viable option that I can think of is to manually iterate over all the documents and insert / update the information, but I thought I would ask here if anyone thinks that you can somehow clear the documents?
Oh, and the data should be fairly correct ...
source
share