I am pasting under a php class, I wrote long , but I know that it works. its not quite what you need, since it is about words instead of the number of characters, but I find it pretty close, and someone might find it useful.
class HtmlWordManipulator { var $stack = array(); function truncate($text, $num=50) { if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text; $text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text); $words = 0; $out = array(); $text = str_replace('<',' <',str_replace('>','> ',$text)); $toks = preg_split('/\s+/', $text); foreach ($toks as $tok) { if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER)) foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]); $out[] = trim($tok); if (! preg_match('/^(<[^>]+>)+$/', $tok)) { if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0) { ++$words; } else { /* echo '<hr />'; echo htmlentities('failed: '.$tok).'<br /)>'; echo htmlentities('has equals: '.strpos($tok,'=')).'<br />'; echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />'; echo htmlentities('strip tags: '.strip_tags($tok)).'<br />'; echo str_word_count($text); */ } } if ($words > $num) break; } $truncate = $this->_truncateRestore(implode(' ', $out)); return $truncate; } function restoreTags($text) { foreach ($this->stack as $tag) $text .= "</$tag>"; return $text; } private function _truncateProtect($match) { return preg_replace('/\s/', "\x01", $match[0]); } private function _truncateRestore($strings) { return preg_replace('/\x01/', ' ', $strings); } private function _recordTag($tag, $args) { // XHTML if (strlen($args) and $args[strlen($args) - 1] == '/') return; else if ($tag[0] == '/') { $tag = substr($tag, 1); for ($i=count($this->stack) -1; $i >= 0; $i--) { if ($this->stack[$i] == $tag) { array_splice($this->stack, $i, 1); return; } } return; } else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a'))) $this->stack[] = $tag; else return; } }
truncate is what you want, and you pass it the html and the number of words you want it to be truncated. it ignores html when counting words, but then iterates over everything in html, even closing trailing tags due to truncation.
Please do not judge me for the complete lack of oop principles. I was young and stupid.
edit:
therefore, it turns out that use is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside private tags.