IOS: Html parsing - how to ignore tags like a, li, etc. inside <p>

Question

IOS: Html parsing - how to ignore tags like a, li, etc. inside <p>

I am currently using Hpple to parse HTML, for example:

TFHpple *htmlParser = [TFHpple hppleWithHTMLData:[currentString dataUsingEncoding:NSUTF8StringEncoding]]; NSString *paragraphsXpathQuery = @"//p//text()"; NSArray *paragraphNodes = [htmlParser searchWithXPathQuery:paragraphsXpathQuery]; if ([paragraphNodes count] > 0) { NSMutableArray *tempArray = [NSMutableArray array]; for (TFHppleElement *element in paragraphNodes) { [tempArray addObject:[element content]]; } article.paragraphs = tempArray; }

Thus, I get an array of paragraphs, and I can use NSString *result = [myArray componentsJoinedByString:@"\n\n"]; to compile it into a single text text with linear errors.

However, if html contains tags, they are interpreted as separate entities and will receive a line broken by themselves, so at the end of the day from a line like this:

 <p>I went to the <a href="blablabla.html">shop</a> to get some milk!</a></p> <p>It was awesome.</p>

I get this:

 I went to the shop to get some milk! It was awesome!

And of course, I would like to get this (ignore the other tags inside the p tag):

 I went to the shop to get some milk! It was awesome!

Can you help me?

+4

ios objective-c html-parsing xpath hpple

Zoltán Matók Sep 14 '12 at 12:48

source share

2 answers

In XPath 1.0, you can do this in two steps :

Select all elements p : //p
For each p element selected (used as the initial context of the node) evaluate this: string()

Explanation

By definition, the result of using the standard XPath string() function for an element is the concatenation (in document order) of all its text descendants node.

+2

Dimitre novatchev Sep 14 '12 at 13:23

source share

AppleDelegate · Accepted Answer · 2012-09-14T13:48:22+0000

 NSString *HTMLTags = @"<[^>]*>"; //regex to remove any html tag NSString *htmlString = @"<html>bla bla</html>"; NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""];

don't forget to include this in your code: #import "RegexKitLite.h" here is the link to download this API: http://regexkit.sourceforge.net/#Downloads

IOS: Html parsing - how to ignore tags like a, li, etc. inside <p>

More articles: