Extract image paths

I need to extract all images from HTML, not just from <img> tags, but from anywhere, including relative paths. I tried this regex:

 ([az\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif)) 

.. but it fails to meet special characters. For example, in this case .

How can I capture a path so that it starts with ' (single quote), " (double quote) or / , between spaces and ends with a jpg|jpeg|png|gif image extension?

Change I use the DOM parser where possible, but here I have to use regex to extract paths from almost everywhere, including inline CSS and JS.

+5
source share
2 answers

You can use lookbehind :

 (?<=['"])[^'"\s]*\.(jpg|jpeg|png|gif) 

This parses any URL that does not contain quotation marks or spaces, and is preceded by a quotation mark.

The advantage (insignificant) of using lookbehind over matching quotes is also that in this way you can use the entire match directly and do not have to separate the quote in post-processing. Lookbehind is not supported by all regex libraries for complexity reasons, but in this case it is no slower than an alternative.

+2
source

This works with your test data:

 ['"\/]([^\s'"]+?\.(jpg|jpeg|png|gif)) 

It starts with a single quote, double quote, or slash, and then captures everything except spaces, single quotes, and double quotes, up to the nearest image extension. Matches are saved in your first capture group (often $1 ).

This solution has the advantage (or perhaps a disadvantage) of not requiring a search.

+1
source

Source: https://habr.com/ru/post/1261529/


All Articles