URL hosting in Python?

For URLs displaying file trees, such as Pypi packages , is there a small solid module to walk the URL tree and list it as ls -lR?
I understand (correct) that there is no standard encoding of file attributes, link types, size, date ... in html <Aattributes
so creating a solid URLtree module for sand shifting is tough.
But of course, this wheel ( Unix file tree -> html -> treewalk API -> ls -lR or find) was made?
(There seem to be a few spiders / web crawlers / scraper, but they look ugly and ad hoc still, despite BeautifulSoup for parsing).

+3
source share
3 answers

Apache servers are very common, and they have a relatively standard way of listing file directories.

Here's a fairly simple script that does what you want, you should be able to do what you want.

Usage: python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url) 
+3
source

Others recommended BeautifulSoup, but it is much better to use lxml . Despite its name, it is also designed to parse and clean HTML. It is much, much faster than BeautifulSoup. It also has a compatibility API for BeautifulSoup if you don't want to learn the lxml API.

Ian Blicking agrees .

BeautifulSoup , Google App Engine - , , Python.

CSS, .

+1

It turns out that BeautifulSoup single-line files like these can turn <table> strings into Python -

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

Compared to the sysrqb single-line regex above, this is ... more; who said

"You can parse some of the html all the time or all html, some of the time, but not ..."

0
source

Source: https://habr.com/ru/post/1705414/


All Articles