Clear a series of tables with BeautifulSoup

Question

Clear a series of tables with BeautifulSoup

I am trying to learn about web scraping and python (and programming, for that matter), and have found the BeautifulSoup library, which seems to offer a lot of features.

I am trying to figure out how best to get the relevant information from this page:

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

I can tell you more about this, but basically the name of the company, a description about it, contact details, various company data / statistics etc

At this point, we’ll look at how to clear the data and clear it so that all this is in CSV or something else.

I am confused how to use BS to capture various table data. There are many tags and so on and are not sure how to get attached to something unique.

The best I came up with is the following code at the beginning:

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie

and then from there use regex etc to extract the data from the cleared text.

But should there be a better way to do this using the BS tree? Or is this site formatted in such a way that BS will not provide more help?

Do not look for a complete solution, as this is a big question, and I want to study, but any pieces of code that will help me along the way will be very grateful.

Update

Thanks to @ZeroPiraeus below, I'm starting to understand how to analyze tables. Here is the result of his code:

=== Personnel ===
bodytext    Ms Gail Morgan CEO
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext    Lisa Mayoh Sales Manager
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: bob@aerospacematerials.com.au

=== Company Details ===
bodytext    ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: lisa@aerospacematerials.com.au Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext    007 350 807
paraheading ABN:
bodytext    71 007 350 807
paraheading 
bodytext    Australian Owned
paraheading Annual Turnover:
bodytext    $5M - $10M
paraheading Number of Employees:
bodytext    6-10
paraheading QA:
bodytext    ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext    5 %
paraheading Industry Categories:
bodytext    AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext    lisa@aerospacematerials.com.au
paraheading Company Website:
bodytext    http://www.aerospacematerials.com.au
paraheading Office:
bodytext    2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext    PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext    +61.3. 9464 4455
paraheading Fax:
bodytext    +61.3. 9464 4422

My next question is: what is the best way to put this data in a CSV that is suitable for import into a spreadsheet? For example, things like "ABN", "ACN", "Company website" etc as column headers and then the corresponding data in the form of row information.

.

+3

python beautifulsoup

Fusilli Jerry 12 . '12 14:02

2

. html, , . , , , , , . , tr td. , XML- , . - , . , Beautiful Soup, , .

0

Dave_750 12 . '12 17:56

Zero Piraeus · Accepted Answer · 2012-11-12T19:32:31+0000

, , , :

import requests

from bs4 import BeautifulSoup

url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113"
html = requests.get(url).text
soup = BeautifulSoup(html)

for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}):
    print "\n=== %s ===" % feature_heading.text
    details = feature_heading.find_next_sibling("td")
    for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}):
        print("\t".join([item["class"][0], " ".join(item.text.split())]))

requests , urllib2, , , .

EDIT:

, CSV :

import csv
import requests

from bs4 import BeautifulSoup

columns = ["ACN", "ABN", "Annual Turnover", "QA"]
urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc.

with open("data.csv", "w") as csv_file:
    writer = csv.DictWriter(csv_file, columns)
    writer.writeheader()
    for url in urls:
        soup = BeautifulSoup(requests.get(url).text)
        row = {}
        for heading in soup.find_all("td", {"class": "paraheading"}):
            key = " ".join(heading.text.split()).rstrip(":")
            if key in columns:
                next_td = heading.find_next_sibling("td", {"class": "bodytext"})
                value = " ".join(next_td.text.split())
                row[key] = value
        writer.writerow(row)

Clear a series of tables with BeautifulSoup

More articles: