I made a small script in PHP, the main part:
$toparse = "htmltext";
$toparse = preg_replace('/(<script.*?>.*?<\/script>|<style.*?>.*?<\/style>|<.*?>|\r|\n|\t)/ms', '', $toparse);
$toparse = preg_replace('/ +/ms', ' ', $toparse);
$textlen = strlen($toparse);
After that, there are several calculations.
This regex may be shorter, but it works. The only requirement is paired <and >.