My PHP script successfully reads all the text from a .docx file, but I cannot figure out where the line break should be such that the text is compressed and hard to read (one huge paragraph). I manually looked through all the XML files to try to figure it out, but I can't figure it out.
Here are the functions that I use to extract file data and return plain text.
public function read($FilePath) { // Save name of the file parent::SetDocName($FilePath); $Data = $this->docx2text($FilePath); $Data = str_replace("<", "<", $Data); $Data = str_replace(">", ">", $Data); $Breaks = array("\r\n", "\n", "\r"); $Data = str_replace($Breaks, '<br />', $Data); $this->Content = $Data; } function docx2text($filename) { return $this->readZippedXML($filename, "word/document.xml"); } function readZippedXML($archiveFile, $dataFile) { // Create new ZIP archive $zip = new ZipArchive; // Open received archive file if (true === $zip->open($archiveFile)) { // If done, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) { // If found, read it to the string $data = $zip->getFromIndex($index); // Close archive file $zip->close(); // Load XML from a string // Skip errors and warnings $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); $xmldata = $xml->saveXML(); //$xmldata = str_replace("</w:t>", "\r\n", $xmldata); // Return data without XML formatting tags return strip_tags($xmldata); } $zip->close(); } // In case of failure return empty string return ""; }
source share