IOS: Html parsing - how to ignore tags like a, li, etc. inside <p>

I am currently using Hpple to parse HTML, for example:

TFHpple *htmlParser = [TFHpple hppleWithHTMLData:[currentString dataUsingEncoding:NSUTF8StringEncoding]]; NSString *paragraphsXpathQuery = @"//p//text()"; NSArray *paragraphNodes = [htmlParser searchWithXPathQuery:paragraphsXpathQuery]; if ([paragraphNodes count] > 0) { NSMutableArray *tempArray = [NSMutableArray array]; for (TFHppleElement *element in paragraphNodes) { [tempArray addObject:[element content]]; } article.paragraphs = tempArray; } 

Thus, I get an array of paragraphs, and I can use NSString *result = [myArray componentsJoinedByString:@"\n\n"]; to compile it into a single text text with linear errors.

However, if html contains tags, they are interpreted as separate entities and will receive a line broken by themselves, so at the end of the day from a line like this:

 <p>I went to the <a href="blablabla.html">shop</a> to get some milk!</a></p> <p>It was awesome.</p> 

I get this:

 I went to the shop to get some milk! It was awesome! 

And of course, I would like to get this (ignore the other tags inside the p tag):

 I went to the shop to get some milk! It was awesome! 

Can you help me?

+4
source share
2 answers
 NSString *HTMLTags = @"<[^>]*>"; //regex to remove any html tag NSString *htmlString = @"<html>bla bla</html>"; NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""]; 

don't forget to include this in your code: #import "RegexKitLite.h" here is the link to download this API: http://regexkit.sourceforge.net/#Downloads

+2
source

In XPath 1.0, you can do this in two steps :

  • Select all elements p : //p

  • For each p element selected (used as the initial context of the node) evaluate this: string()

Explanation

By definition, the result of using the standard XPath string() function for an element is the concatenation (in document order) of all its text descendants node.

+2
source

Source: https://habr.com/ru/post/1434192/


All Articles