How to translate .doc to string?

Is there a way to translate a Microsoft Word document into a string without using the Microsoft COM component? I hope there is another way to deal with all the redundant markup.

EDIT 12/13/13: We did not want to refer to the com component, because if the client did not have the same installed version of the office, this did not work. Fortunately, Microsoft made the version of word.interop.dll 2013 backward compatible. Now we do not need to worry about this restriction. After referencing the dll, we can do the following:

/// <summary>Gets the content of the word document</summary> /// <param name="filePath">The path to the word document file</param> /// <returns>The content of the document</returns> public string ExtractText(string filePath) { if (string.IsNullOrEmpty(filePath)) throw new ArgumentNullException("filePath", "Input file path not specified."); if (!File.Exists(filePath)) throw new FileNotFoundException("Input file not found at specified path.", "filepath"); var resultText = string.Empty; Application wordApp = null; try { wordApp = new Application(); var doc = wordApp.Documents.Open(filePath, Type.Missing, true); if (doc != null) { if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text)) resultText = doc.Content.Text.Normalize(); doc.Close(); } } finally { if (wordApp != null) wordApp.Quit(false, Type.Missing, false); } return resultText; } 
+4
source share
3 answers

To achieve what you want, you need a library:

IF you have a lot of time on your hands, then you can think of writing a .DOC parser - here you can find the .DOC specification here .

BTW: Office Interop is not supported by MS in server-like scenarios (e.g. ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757 # kb2 !

+2
source

Assuming you want to extract the text contents of a doc file, there are several command line tools as well as commercial libraries. The pretty old tool we once used to search for doc (not docx) files (combined with the sphider search engine) was catdoc (also here ), which is DOS and not a Windows tool, but nonetheless worked for us if we have met the prerequisites (file format 8.3).

A commercial doc2txt product if you can afford $ 29.

For the new docx format, you can use the Perl-based tool docx2txt .

Of course, if you want to run these tools from C #, you need to call an external process - check here for a reasonable explanation.

A fairly expensive but very powerful tool for accessing the contents of doc and docx Spire.doc , but it is much more than you need. This is more convenient to use since it is a .NET library.

+1
source

If you mean the older DOC file format, this is quite a problem because it is the binary file format specified in MS, and I have to say that I completely agree with the RQDQ comment.

But if you refer to the DOCX file format, you can achieve this without the MS COM component or any other component, just pure .NET.

Check out the following solutions:

http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files http://www.dotnetspark.com/kb/Content.aspx?id=5633

0
source

Source: https://habr.com/ru/post/1389551/


All Articles