If there is a PHP function like ucfirst () that will ignore html?

I programmatically clear some basic grammar in comments and other user submitted content. Capital letter I, first letter of sentence, etc. Comments and content are mixed with HTML, as users have some options in formatting their text.

This actually proves a bit more complex than expected, especially for someone new to PHP and regex.

If there is a function like ucfirst that will ignore html to help capitalize?

Any links or guides to clear text like this in html will also be appreciated. Please leave all that you think will help in the comments. thanks!

EDIT: Sample text:

<div><p>i wuz walkin thru the PaRK and found <strong>ur dog</strong>. <br />i hoPe to get a reward.<br /> plz call or text 7zero4 8two8 49 sevenseven</div> 

I need it to be (ultimately)

 <div><p>I was walking through the park and found <strong>your dog<strong>. <p>I hope to get a reward.</p><p> Please call or text (704) 828-4977.</p> 

I know this goes a little further than the supposed question, but I thought about it gradually. ucfirst () is just one of many functions that I used to do a little cleanup at a time for scanning. Even if I had to run the text 100 times through the filter, this is done when cron starts, when there is no traffic on the site. I would like there to be a discussion forum where this could continue, because obviously there would be great ideas regarding the continuation of the approach. Any thoughts on how to approach this as a common project, please leave a comment.

I think in the spirit of the question itself. ucfirst would not be the best function to do this, since it cannot accept a list of arguments to ignore. The IGNORE_HTML flag will be great!

Given this is a PHP question, then the DOM parser recommended below sounds like the best answer? Thoughts?

+4
source share
4 answers

You should probably use a DOM parser (a built-in one or for example this one , which is very easy to use).

Go through all the text nodes in your HTML and do a cleanup using preg_replace_callback , ucfirst and a regular expression like this:

 '/(\s*)([^.?!]*)/' 

This will correspond to a line of spaces, and then as many characters as possible, other than the election punctuation. The actual sentence (starting with a letter, if your sentence begins with " , which complicates the situation a bit), will be found in the first capture group.

But from your question, I suppose you are already doing something like the latter, and your code is simply choking on HTML tags. Here is an example code to get all text nodes with the second DOM parser that I linked:

 require 'simple_html_dom.php'; $html = new simple_html_dom(); $html->load($fullHtmlStr); foreach($html->find('text') as $textNode) $textNode = cleanupFunction($textNode); $cleanedHtmlStr = $html->save(); 
+1
source

You can also add CSS pseudo-element to your elements as follows:

 div:first-letter { text-transform: uppercase; } 

But you probably have to change the way you print your messages (if you print them all in one huge tag), because CSS is not able to detect the beginning of a new sentence inside a single tag :(

+4
source

In html, it will be very difficult to do, since you will be creating some kind of html parser. My suggestion was to clear the text before it is converted to html, you are currently pulling it from the database. Or even better, flush the database once.

0
source

This should do it:

 function html_ucfirst($s) { return preg_replace_callback('#^((<(.+?)>)*)(.*?)$#', function ($c) { return $c[1].ucfirst(array_pop($c)); }, $s); } 

Conversion

  • <b>foo</b> to <b>foo</b> ,
  • <div><p>test</p></div> to <div><p>test</p></div> ,
  • but also bar to bar .

Edit: according to your detailed question, you probably want to apply this function to every sentence. First you have to parse the text first (e.g. split by period).

0
source

Source: https://habr.com/ru/post/1441768/


All Articles