I am trying to learn about web scraping and python (and programming, for that matter), and have found the BeautifulSoup library, which seems to offer a lot of features.
I am trying to figure out how best to get the relevant information from this page:
http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113
I can tell you more about this, but basically the name of the company, a description about it, contact details, various company data / statistics etc
At this point, we’ll look at how to clear the data and clear it so that all this is in CSV or something else.
I am confused how to use BS to capture various table data. There are many tags and so on and are not sure how to get attached to something unique.
The best I came up with is the following code at the beginning:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie
and then from there use regex etc to extract the data from the cleared text.
But should there be a better way to do this using the BS tree? Or is this site formatted in such a way that BS will not provide more help?
Do not look for a complete solution, as this is a big question, and I want to study, but any pieces of code that will help me along the way will be very grateful.
Update
Thanks to @ZeroPiraeus below, I'm starting to understand how to analyze tables. Here is the result of his code:
=== Personnel ===
bodytext Ms Gail Morgan CEO
bodytext Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext Lisa Mayoh Sales Manager
bodytext Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: bob@aerospacematerials.com.au
=== Company Details ===
bodytext ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: lisa@aerospacematerials.com.au Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext 007 350 807
paraheading ABN:
bodytext 71 007 350 807
paraheading
bodytext Australian Owned
paraheading Annual Turnover:
bodytext $5M - $10M
paraheading Number of Employees:
bodytext 6-10
paraheading QA:
bodytext ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext 5 %
paraheading Industry Categories:
bodytext AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext lisa@aerospacematerials.com.au
paraheading Company Website:
bodytext http://www.aerospacematerials.com.au
paraheading Office:
bodytext 2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext +61.3. 9464 4455
paraheading Fax:
bodytext +61.3. 9464 4422
My next question is: what is the best way to put this data in a CSV that is suitable for import into a spreadsheet? For example, things like "ABN", "ACN", "Company website" etc as column headers and then the corresponding data in the form of row information.
.