Unable to get regex matching in Java

This is the format / example of the string I want to get:

<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>EspaΓ±ol </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada </a></span><br> </div> 

And this is the regular expression that I use for it:

 "pelicula/([0-9]*)'>([\\w\\s]*)</a>" 

I checked this regular expression in RegexPlanet and it turned out OK, it gave me the expected result:

 group(1) = 18313 group(2) = Subtitulada 

But when I try to implement this regular expression in Java, it will not match anything. Here is the code:

 Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>"); Matcher matcher = pattern.matcher(inputLine); while(matcher.find()){ version = matcher.group(2); } } 

What is the problem? If the regex has already been tested, and in the same code I'm looking for more patterns, but I have problems with two (I show you only one). Thank you in advance!

_ EDIT __

I found a problem ... If I check the source code of the page, it shows everything, but when I try to use it with Java, it gets another source code. What for? Because this page asks your city so that it can show information about it. I do not know if there is a workaround for this to actually access the information I want, but what is it.

+4
source share
2 answers

Your regular expression is correct, but it seems that \w does not match Γ± .

I changed the regex to

"pelicula/([0-9]*)'>(.*?)</a>"

and both occurrences seem to match. Here I used the reluctant operator *? to prevent .* matching of all characters between the first <a> and last <\a> See What is the difference between `Greedy` and` Reuctant` regular expression quantifiers? for an explanation.

@Bohemian correctly indicates that you may need to enable the Pattern.DOTALL flag if the text in <a> has line breaks

+2
source

If your input exceeds several lines (i.e. contains newline characters), you need to include "dot matches newline".

There are two ways to do this:

Use the regular expression switch (?s) in the regular expression:

 Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>"); 

or use the Pattern.DOTALL flag when calling Pattern.compile() :

 Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL); 
+1
source

Source: https://habr.com/ru/post/1446363/


All Articles