Using Regex to Find HTML Links Next to Keywords

Question

Using Regex to Find HTML Links Next to Keywords

If I search for the keyword "sales" and I want to get the closest "http://www.somewebsite.com", even if there are several links in the file. I want the closest link not to be the first. This means that I need to find the link that appears immediately before the keywords match.

This does not work...

regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales sales

What is the best way to find the link closest to the keyword?

+4

python regex negative-lookahead

htmlfarmer Jan 23 '12 at 1:05

source share

4 answers

It is generally much easier and more reliable to use an HTML parser rather than a regular expression.

Using a third-party lxml module:

 import lxml.html as LH content = '''<html><a href="http://www.not-this-one.com"></a> <a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p> </html> ''' doc = LH.fromstring(content) for url in doc.xpath(''' //*[contains(text(),"sales")] /preceding::*[starts-with(@href,"http")][1]/@href'''): print(url)

gives

 http://www.somewebsite.com

I find lxml (and XPath) a convenient way to express the elements I'm looking for. However, if installing a third-party module is not an option, you can also do this specific work with HTMLParser from the standard library:

 import HTMLParser import contextlib class MyParser(HTMLParser.HTMLParser): def __init__(self): HTMLParser.HTMLParser.__init__(self) self.last_link = None def handle_starttag(self, tag, attrs): attrs = dict(attrs) if 'href' in attrs: self.last_link = attrs['href'] content = '''<html><a href="http://www.not-this-one.com"></a> <a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p> </html> ''' idx = content.find('sales') with contextlib.closing(MyParser()) as parser: parser.feed(content[:idx]) print(parser.last_link)

Regarding the XPath used in the lxml solution: XPath has the following meaning:

  //* # Find all elements [contains(text(),"sales")] # whose text content contains "sales" /preceding::* # search the preceding elements [starts-with(@href,"http")] # such that it has an href attribute that starts with "http" [1] # select the first such <a> tag only /@href # return the value of the href attribute

+3

unutbu Jan 23 '12 at 2:43

source share

I do not think that you can do this only with a regular expression (especially before a match with a keyword), since it does not make any sense to compare distances.

I think you are best off doing something like this:

find all sales occurrences and get a substring index called salesIndex
find all occurrences https?://[-A-Za-z0-9./]+ and get the substring index called urlIndex
execute a salesIndex . For each location i in salesIndex find the nearest urlIndex .

Depending on how you want to judge the “nearest,” you may need to compare the start and end indices of the sales and http... events. those. find the ending index of the URL that is closest to the starting index of the current sales appearance, and finding the starting index of the URL that is closest to the end of the index of the current sales appearance, and select the one closest.

You can use matches = re.finditer(pattern,string,re.IGNORECASE) to get a list of matches, and then match.span() to get the start / end substring indices for each match in matches .

0

mathematical.coffee Jan 23 '12 at 2:21

source share

Based on what the mathematical method suggested, you can try something in this direction:

 import re myString = "" ## the string you want to search link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE) sales_matches = re.finditer('sales',myString,re.IGNORECASE) link_locations = [] for match in link_matches: link_locations.append([match.span(),match.group()]) for match in sales_matches: match_loc = match.span() distances = [] for link_loc in link_locations: if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword ## append the distance between the END of the keyword and the START of the link distances.append(match_loc[0] - link_loc[0][1]) else: ## append the distance between the END of the link and the START of the keyword distances.append(link_loc[0][0] - match_loc[1]) for d in range(0,len(distances)-1): if distances[d] == min(distances): print ("Closest Link: " + link_locations[d][1] + "\n") break

0

Moritz Jan 25 '12 at 20:42

source share

htmlfarmer · Accepted Answer · 2012-01-29T21:23:28+0000

I tested this code and it seemed to work ...

 def closesturl(keyword, website): keylist = [] urllist = [] closest = [] urls = [] urlregex = "(http|https)://[-A-Za-z0-9\\./]+" urlmatches = re.finditer(urlregex, website, re.IGNORECASE) keymatches = re.finditer(keyword, website, re.IGNORECASE) for n in keymatches: keylist.append([n.start(), n.end()]) if(len(keylist) > 0): for m in urlmatches: urllist.append([m.start(), m.end()]) if((len(keylist) > 0) and (len(urllist) > 0)): for i in range (0, len(keylist)): closest.append([abs(urllist[0][0]-keylist[i][0])]) urls.append(website[urllist[0][0]:urllist[0][1]]) if(len(urllist) >= 1): for j in range (1, len(urllist)): if((abs(urllist[j][0]-keylist[i][0]) < closest[i])): closest[i] = abs(keylist[i][0]-urllist[j][0]) urls[i] = website[urllist[j][0]:urllist[j][1]] if((abs(urllist[j][0]-keylist[i][0]) > closest[i])): break # local minimum / inflection point break from url list if((len(keylist) > 0) and (len(urllist) > 0)): return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]] else: return "" somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword" keyword = "mykeyword" print closesturl(keyword, somestring)

The above at startup shows ... http://www.secondlink.com .

If someone got ideas on how to speed up this code, that would be awesome!

Thanks V $ H.

Using Regex to Find HTML Links Next to Keywords

More articles: