Clearing Blog Content

Question

Clearing Blog Content

After getting the urls for various blogs, tumblr and wordpress pages, I ran into some html page processing issues. The fact is that I want to distinguish between the content, title and date for each blog post. I could get the date through regex, but there are so many custom scripts that people use now when the classes and html structure are so different.

Anyone have a solution that might help?

+3

python

goh Jun 17 '10 at 2:41

source share

3 answers

, RSS Atom - XML, HTML, Universal Feed Parser Python.

- ( ), HTML (!), BeautifulSoup ( 3.0.*, a 3.1 - , . ) - , HTML ( , , , HTML). lxml, @Hank , BeautifulSoup, , , , , ? -)

+3

Alex Martelli 17 . '10 2:58

I think you should change your approach. Instead of parsing the html page, why not parse the RSS feed ? Wordpress has a built-in system, and it already contains the necessary information, such as names, author, dates, etc.

You can still use regex to parse RSS feeds or use existing python modules like Universal Parser

+1

Benjamin intal Jun 17 '10 at 2:59

source share

Hank Gay · Accepted Answer · 2010-06-17T02:46:38+0000

Do not use regex. Use a parser. lxmlvery fast.

, Atom RSS-, ; , , .

UPDATE:

<link> HTML- . - ( type Atom RSS ..):

<link rel="alternate" type="application/atom+xml" title="My Weblog feed" href="/feed/" />

<head> . , Universal Parser, @Alex Martelli.

, PyCon.

Clearing Blog Content

More articles: