Question about HTML parsing using Regex and Java

I have a question about finding html tags using Java and Regex.

I use the code below to find all tags in HTML, documentURL is obviously the HTML content.

The find method returns true, which means that it can find something in the HTML, but the matches () method always returns false, and I'm completely and completely puzzled by this.

I also referenced the Java documentation, but could not find the answer.

What is the correct way to use matcher?

Pattern keyLineContents = Pattern.compile("(<.*?>)"); Matcher keyLineMatcher = keyLineContents.matcher(documentURL); boolean result = keyLineMatcher.find(); boolean matchFound = keyLineMatcher.matches(); 

Doing something like this leads to release:

  String abc = keyLineMatcher.group(0); 

Thanks.

+4
source share
3 answers

The correct way to match matches is:

 Pattern p = Pattern.compile("<.*?>"); Matcher m = p.matcher(htmlString); while (m.find()) { System.out.println(m.group()); } 

Regular expressions are an extremely poor HTML parsing tool. The reason is this: regular expressions work well for parsing common languages . HTML is a context-free language . Where regular expressions fall, things like nested tags are used, using > values ​​inside attributes, etc.

Instead, use a dedicated HTML parser, such as HTML Parser .

+7
source

Why don't you try looking at the source code of some open source HTML parsers? HtmlCleaner, Tags, etc.

The overall strategy seems to be to try to parse and clear the html and return the Xml tree.

Personally, I read the HTML message by adding opening tags to the LIFO queue and removing (matching) opening tags from the beginning of the queue when a closing tag is detected - performing a queue switch to allow tag inconsistencies.

+2
source

I want to get keyword content from an HTML tag that I wrote:

 Pattern keyLineContents = Pattern.compile("<(.[^<]*)(keywords)(.[^<]*)>"); Matcher keyLineMatcher = keyLineContents.matcher(documentURL); boolean result = keyLineMatcher.find(); while(result) { String metaTagContent = keyLineMatcher.group(1) + " " + keyLineMatcher.group(3); Pattern kcontent = Pattern.compile("(.*?content=\")(.[^<]*?)(\".[^<]*?)"); Matcher keyLineMatcher2 = kcontent.matcher(metaTagContent); boolean result2 = keyLineMatcher.find(); while (result2) { String metaTagContent2 = keyLineMatcher.group(1); result2 = keyLineMatcher.find(); } } 

But I do not understand why my result2 is false. The result is one in order, the entire contents of the keyword tag

thanks

+1
source

Source: https://habr.com/ru/post/1303357/


All Articles