Clearing Blog Content

After getting the urls for various blogs, tumblr and wordpress pages, I ran into some html page processing issues. The fact is that I want to distinguish between the content, title and date for each blog post. I could get the date through regex, but there are so many custom scripts that people use now when the classes and html structure are so different.

Anyone have a solution that might help?

+3
source share
3 answers

Do not use regex. Use a parser. lxmlvery fast.

, Atom RSS-, ; , , .

UPDATE:

<link> HTML- . - ( type Atom RSS ..):

<link rel="alternate" type="application/atom+xml" title="My Weblog feed" href="/feed/" />

<head> . , Universal Parser, @Alex Martelli.

, PyCon.

+2

, RSS Atom - XML, HTML, Universal Feed Parser Python.

- ( ), HTML (!), BeautifulSoup ( 3.0.*, a 3.1 - , . ) - , HTML ( , , , HTML). lxml, @Hank , BeautifulSoup, , , , , ? -)

+3

I think you should change your approach. Instead of parsing the html page, why not parse the RSS feed ? Wordpress has a built-in system, and it already contains the necessary information, such as names, author, dates, etc.

You can still use regex to parse RSS feeds or use existing python modules like Universal Parser

+1
source

Source: https://habr.com/ru/post/1750416/


All Articles