Return specific content

Question

Return specific content

I only need IP addresses. How to undo it. My code right now is

import urllib
import urllib.request
from bs4 import BeautifulSoup

x = urllib.request.urlopen('http://bannedhackersips.blogspot.com/2014_08_04_archive.html')
soup = BeautifulSoup(x,"html.parser")
data = soup.find_all("ul", {"class": "posts"})

for content in data:
   print(content.text)

Output:

[Fail2Ban] SSH: banned 116.10.191.162
[Fail2Ban] SSH: banned 116.10.191.204
[Fail2Ban] SSH: banned 61.174.51.232
[Fail2Ban] SSH: banned 61.174.51.224
[Fail2Ban] SSH: banned 116.10.191.225
[Fail2Ban] SSH: banned 200.162.47.130
[Fail2Ban] SSH: banned 116.10.191.175
[Fail2Ban] SSH: banned 61.174.51.223
[Fail2Ban] SSH: banned 61.174.51.234
[Fail2Ban] SSH: banned 61.174.51.209
[Fail2Ban] SSH: banned 116.10.191.165
[Fail2Ban] SSH: banned 106.240.247.220

+4

python web-scraping beautifulsoup

KK Sep 23 '15 at 17:07

source share

1 answer

Padraic cunningham · Accepted Answer · 2015-09-23T17:15:52+0000

You can extract a regular expression from the text:

data = soup.find("ul", {"class": "posts"})

import re

r = re.compile("\d+\.\d+\.\d+\.\d+")

print(r.findall(data.text))
['116.10.191.162', '116.10.191.204', '61.174.51.232', '61.174.51.224', '116.10.191.225', '200.162.47.130', '116.10.191.175', '61.174.51.223', '61.174.51.234', '61.174.51.209', '116.10.191.165', '106.240.247.220']

Or as the patterns repeat, you can split into substrings with dividing lines and split once from the end of each substring to extract ip:

data = soup.find("ul", {"class": "posts"})

ips = [line.rsplit(None, 1)[1] for line in data.text.splitlines() if line]

print(ips)
['116.10.191.162', '116.10.191.204', '61.174.51.232', '61.174.51.224', '116.10.191.225', '200.162.47.130', '116.10.191.175', '61.174.51.223', '61.174.51.234', '61.174.51.209', '116.10.191.165', '106.240.247.220']

There is only one class on the page posts, so it’s enough to find when you iterate over find_all, you actually iterate over one list of elements.

Return specific content

More articles: