Problem with re.findall (duplicates)

I tried to find the source of the 4chan site and get links to streams.

I have a problem with regexp (not working). Source:

import urllib2, re

req = urllib2.Request('http://boards.4chan.org/wg/')
resp = urllib2.urlopen(req)
html = resp.read()

print re.findall("res/[0-9]+", html)
#print re.findall("^res/[0-9]+$", html)

The problem is that:

print re.findall("res/[0-9]+", html)

gives duplicates.

I can not use:

print re.findall("^res/[0-9]+$", html)

I read the python docs but they didn't help.

+3
source share
1 answer

This is because the source has multiple copies of the link.

You can easily make them unique by putting them in a set.

>>> print set(re.findall("res/[0-9]+", html))
set(['res/3833795', 'res/3837945', 'res/3835377', 'res/3837941', 'res/3837942',
'res/3837950', 'res/3100203', 'res/3836997', 'res/3837643', 'res/3835174'])

But if you are going to do something more complex, I recommend that you use a library that can parse HTML. Either BeautifulSoup or lxml .

+11
source

Source: https://habr.com/ru/post/1782946/


All Articles