Finding a date in a long string using PHP

I have a collection of documents and am trying to get dates from them. This is plain text and HTML basically, but the date formats that they use are very strong (although these are all English dates). How to find and analyze dates like this in a long line of text?

updated 2011-03-21T00:43:14 Sunday, March 20, 2011 Wednesday, March 16, 2011 | 11:25 AM March 20, 2011 @ 12:21 pm May 5, 2011 Published March 19, 2011 Some text here (March 19, 2011) 10/28/2011 21:16 <a href="#>Author Name</a> on Mar 17th 2011 ... Location, ABBR., Jan. 8, 2008 01/07/2008 (6:00 pm) By Author Name and Company 03/19/2011 09:59 Posted by Author Name on March 16, 2011 at 03:20 PM EDT 
+4
source share
2 answers

Take a look at the strtotime function.

 // Output: March 20th, 2011 12:00:00 AM echo date( 'F jS, Y h:i:s A', strtotime( "Sunday, March 20, 2011")); 

Edit: Here is a more complete example showing how to parse a bunch of dates.

 <?php $dates = array( '03/19/2011 09:59', 'Wednesday, March 16, 2011 | 11:25 AM', 'Sunday, March 20, 2011', 'March 20, 2011 @ 12:21 pm', 'May 5, 2011'); foreach( $dates as $date) { echo $date . ' ---- ' . date( 'F jS, Y h:i:s A', strtotime( str_replace( array( '@', '|'), '', $date))) . "<br />\n"; } 

Demo

Of course, some dates will not be parsed as they are, since they are not supported by a list of date formats - for those, you will need to do additional filtering / parsing to either extract their date or form them into a string suitable for strtotime.

Edit:. Since interest is in further processing the input string, here is an example of how you can parse text without using a regular expression to get dates. Please note that some of the dates simply cannot be retrieved, for this you will need either more string processing or use a regular expression.

As a side note, I would investigate using a regex if the provided string is just one of many variations of strings containing dates. However, if the string provided is the only format in which dates will be found, string processing should be sufficient.

 $str = 'updated 2011-03-21T00:43:14 Sunday, March 20, 2011 Wednesday, March 16, 2011 | 11:25 AM March 20, 2011 @ 12:21 pm May 5, 2011 Published March 19, 2011 Some text here (March 19, 2011) 10/28/2011 21:16 <a href="#">Author Name</a> on Mar 17th 2011 ... Location, ABBR., Jan. 8, 2008 01/07/2008 (6:00 pm) By Author Name and Company 03/19/2011 09:59 Posted by Author Name on March 16, 2011 at 03:20 PM EDT'; foreach( explode( "\n", $str) as $line) { $line = str_replace( array( '@', '|', '(', ')'), '', trim( $line)); $line = strip_tags( $line); if( ($time = strtotime( $line)) === false) { echo "Could not parse line - '" . $line . "'\n"; // Need additional processing / regex here continue; } echo "Converted '" . $line . "' to '" . date( 'F jS, Y h:i:s A', $time) . "'\n"; } 

Demo version

Final Edit:

Finally, an example of how to do some text processing to get more data for parsing.

 foreach( explode( "\n", $str) as $line) { $line = str_replace( array( '@', '|', '(', ')', 'Published', '...'), '', trim( $line)); $line = strip_tags( trim( $line)); if( ($time = strtotime( $line)) === false) { if( !(($on_position = stripos( $line, 'on')) === false)) { $line = substr( $line, $on_position + 3); if( ($time = strtotime( trim( $line))) === false) { echo "Could not parse line that contains 'on' - '" . $line . "'\n"; continue; } } echo "Could not parse line - '" . $line . "'\n"; continue; } echo "Converted '" . $line . "' to '" . date( 'F jS, Y h:i:s A', $time) . "'\n"; } 

Demo

+2
source

I didn’t have much time tonight, so I played with some regex, knowing that I was looking for a digital band. The following fully parses everything below. In addition, foreach is just an example. The regular expression is built for preg_match_all() so that you can easily pull all dates from the string.

 $str = 'updated 2011-03-21T00:43:14 Sunday, March 20, 2011 Wednesday, March 16, 2011 | 11:25 AM March 20, 2011 @ 12:21 pm May 5, 2011 Published March 19, 2011 Some text here (March 19, 2011) 10/28/2011 21:16 <a href="#">Author Name</a> on Mar 17th 2011 ... Location, ABBR., Jan. 8, 2008 01/07/2008 (6:00 pm) Published under recent news one March 17, 2011. Now onto other things! By Author Name and Company 03/19/2011 09:59 Posted by Author Name on March 16, 2011 at 03:20 PM EDT'; $months = array( 'jan', 'january', 'feb', 'febuary', 'mar', 'march', 'apr', 'april', 'may', 'june', 'july', 'aug', 'august', 'sept', 'september', 'oct', 'october', 'nov', 'november', 'dec', 'december', ); header('Content-Type: text/plain'); foreach(explode( "\n", $str) as $line) { $line = str_replace(array('@', '|', '(', ')', 'at', 'on', 'am', 'pm'), '', mb_strtolower(trim($line))); if(preg_match('/([az]+[, .]+)?(\d.+?)\D*?$/m', $line, $match)) { $date = ''; // Is that word a valid month? if(in_array(trim($match[1], ',. '), $months)) { $date = $match[1]; } $date .= $match[2]; if( ($date = strtotime($date)) !== false) { echo "Converted '" . $line . "' to '" . date( 'F jS, Y h:i:s A', $date) . "'\n"; continue; } } else { print "Failed to find anything\n"; } echo "Could not parse line - '" . $line . "'\n"; // Need additional processing / regex here } 

This is a pretty hacky feeling, maybe someone can respond with a better parser.

+2
source

Source: https://habr.com/ru/post/1380954/


All Articles