Finding strings in a docx file using PHP

My PHP script successfully reads all the text from a .docx file, but I cannot figure out where the line break should be such that the text is compressed and hard to read (one huge paragraph). I manually looked through all the XML files to try to figure it out, but I can't figure it out.

Here are the functions that I use to extract file data and return plain text.

public function read($FilePath) { // Save name of the file parent::SetDocName($FilePath); $Data = $this->docx2text($FilePath); $Data = str_replace("<", "&lt;", $Data); $Data = str_replace(">", "&gt;", $Data); $Breaks = array("\r\n", "\n", "\r"); $Data = str_replace($Breaks, '<br />', $Data); $this->Content = $Data; } function docx2text($filename) { return $this->readZippedXML($filename, "word/document.xml"); } function readZippedXML($archiveFile, $dataFile) { // Create new ZIP archive $zip = new ZipArchive; // Open received archive file if (true === $zip->open($archiveFile)) { // If done, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) { // If found, read it to the string $data = $zip->getFromIndex($index); // Close archive file $zip->close(); // Load XML from a string // Skip errors and warnings $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); $xmldata = $xml->saveXML(); //$xmldata = str_replace("</w:t>", "\r\n", $xmldata); // Return data without XML formatting tags return strip_tags($xmldata); } $zip->close(); } // In case of failure return empty string return ""; } 
+4
source share
2 answers

This is actually a pretty simple answer. All you have to do is add this line to readZippedXML() :

 $xmldata = str_replace("</w:p>", "\r\n", $xmldata); 

This is due to the fact that </ w: p> this word is used to indicate the end of a paragraph. For instance.

 <w:p>This is a paragraph.</w:p> <w:p>And a second one.</w:p> 
+8
source

Actually, why don't you use OpenXML? I think this works with PHP as well. And then you do not need to go down to the detailed information about the XML file.

Here's a link:
http://openxmldeveloper.org/articles/4606.aspx

0
source

Source: https://habr.com/ru/post/1347383/


All Articles