Is there a way to translate a Microsoft Word document into a string without using the Microsoft COM component? I hope there is another way to deal with all the redundant markup.
EDIT 12/13/13: We did not want to refer to the com component, because if the client did not have the same installed version of the office, this did not work. Fortunately, Microsoft made the version of word.interop.dll 2013 backward compatible. Now we do not need to worry about this restriction. After referencing the dll, we can do the following:
/// <summary>Gets the content of the word document</summary> /// <param name="filePath">The path to the word document file</param> /// <returns>The content of the document</returns> public string ExtractText(string filePath) { if (string.IsNullOrEmpty(filePath)) throw new ArgumentNullException("filePath", "Input file path not specified."); if (!File.Exists(filePath)) throw new FileNotFoundException("Input file not found at specified path.", "filepath"); var resultText = string.Empty; Application wordApp = null; try { wordApp = new Application(); var doc = wordApp.Documents.Open(filePath, Type.Missing, true); if (doc != null) { if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text)) resultText = doc.Content.Text.Normalize(); doc.Close(); } } finally { if (wordApp != null) wordApp.Quit(false, Type.Missing, false); } return resultText; }
source share