Pairing content on an external website with records in mySQL database

Question

Pairing content on an external website with records in mySQL database

tl; dr: I am looking for a way to find records in our database that lack information, get this information from the website and add it to the database record.

We have a media management program that uses the mySQL table to store information. When employees upload media files (video files, images, audio files) and import them into the media manager, they must also copy the media description (from the source site) and add it to the description in Media Manager. However, this was not done for thousands of files.

The file name (for example, file123 .mov) is unique, and the details page for this file can be obtained by clicking on the URL on the original website:

website.com/content/ file123

The information that we want to clear from this page has an element identifier that is always the same.

In my opinion, the process will be as follows:

Connect to the database and download the table
Filter : "format" is "Still Image (JPEG)"
Filter : "description" is "NULL"
Get the first result
Get "FILENAME" without extension)
Download URL : website.com/content/ FILENAME
Copy the contents of the "description" element (on the website)
Paste content into "description" (SQL record)
Get a second result
Rinse and repeat until the last result

My question (s):

Is there software that could do this, or is it something that needs to be scripted?
If the script, what will be the best type of script (for example, I can achieve this using AppleScript or it will need to be done in java or php, etc.).

+6

mysql html-parsing

Orangebox May 23 '12 at 2:32

source share

3 answers

Is there any software that could accomplish such a task, or is it something that needs to be written?
I don’t know anything that will do what you want out of the box (and even if it were, the required configuration would not be much less work than scripts related to translating your own solution).

If the script, what would be the best type of script (for example, I can achieve this using AppleScript or it will need to be done in java or php, etc.)

AppleScript cannot connect to databases, so you definitely need to add something else to the mix. If choosing between Java and PHP (and you are equally familiar with both), I would definitely recommend PHP for this purpose, since there will be significantly less involved code.

Your PHP script will look something like this:

 $BASEURL = 'http://website.com/content/'; // connect to the database $dbh = new PDO($DSN, $USERNAME, $PASSWORD); // query for files without descriptions $qry = $dbh->query(" SELECT FILENAME FROM mytable WHERE format = 'Still Image (JPEG)' AND description IS NULL "); // prepare an update statement $update = $dbh->prepare(' UPDATE mytable SET description = :d WHERE FILENAME = :f '); $update->bindParam(':d', $DESCRIPTION); $update->bindParam(':f', $FILENAME); // loop over the files while ($FILENAME = $qry->fetchColumn()) { // construct URL $i = strrpos($FILENAME, '.'); $url = $BASEURL . (($i === false) ? $FILENAME : substr($FILENAME, 0, $i)); // fetch the document $doc = new DOMDocument(); $doc->loadHTMLFile($url); // get the description $DESCRIPTION = $doc->getElementsById('description')->nodeValue; // update the database $update->execute(); }

+2

eggyal May 25 '12 at 8:27

source share

PHP is a good scraper. I created a class that wraps the PHP cURL port here:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

You will probably need to use some parameters:

http://www.php.net/manual/en/function.curl-setopt.php

To clear HTML, I usually use regular expressions, but here is the class I made that should be able to query HTML without problems:

http://pastebin.com/Jm9jKjAU

Using:

 $h = new HTMLQuery(); $h->load( $string_containing_html ); $h->getElements( 'p', 'id' ); // Returns all p tags with an id attribute

XPath is the best option for cleaning, but it cannot handle dirty HTML. You can use this to do things like:

// div [@ class = 'itm'] / p [last () and text () = 'Hello World'] <- selects the last p from div elements that have innerHTML Hello World '

You can use this in PHP with the DOM class (inline).

+1

Kohjah breese May 23 '12 at 3:11

source share

Grantj · Accepted Answer · 2012-05-31T00:34:11+0000

I also do not know about the existing software packages that will do everything you are looking for. However, Python can connect to your database, easily create web requests and handle dirty html. Assuming you already have Python installed, you will need three packages:

MySQLdb to connect to the database.
Requests for the convenience of creating http web requests.
BeautifulSoup for reliable html analysis.

You can install these packages with pip commands or Windows installers. Relevant instructions are provided on each site. The whole process does not take more than 10 minutes.

 import MySQLdb as db import os.path import requests from bs4 import BeautifulSoup # Connect to the database. Fill in these fields as necessary. con = db.connect(host='hostname', user='username', passwd='password', db='dbname') # Create and execute our SELECT sql statement. select = con.cursor() select.execute('SELECT filename FROM table_name \ WHERE format = ? AND description = NULL', ('Still Image (JPEG)',)) while True: # Fetch a row from the result of the SELECT statement. row = select.fetchone() if row is None: break # Use Python built-in os.path.splitext to split the extension # and get the url_name. filename = row[0] url_name = os.path.splitext(filename)[0] url = 'http://www.website.com/content/' + url_name # Make the web request. You may want to rate-limit your requests # so that the website doesn't get angry. You can slow down the # rate by inserting a pause with: # # import time # You can put this at the top with other imports # time.sleep(1) # This will wait 1 second. response = requests.get(url) if response.status_code != 200: # Don't worry about skipped urls. Just re-run this script # on spurious or network-related errors. print 'Error accessing:', url, 'SKIPPING' continue # Parse the result. BeautifulSoup does a great job handling # mal-formed input. soup = BeautifulSoup(response.content) description = soup.find('div', {'id': 'description'}).contents # And finally, update the database with another query. update = db.cursor() update.execute('UPDATE table_name SET description = ? \ WHERE filename = ?', (description, filename))

I will warn you that I have done my best to make this code “correct”, but I have not tested it. You will need to fill in your personal details.

Pairing content on an external website with records in mySQL database

More articles: