Spider for websites and return URLs only

Question

Spider for websites and return URLs only

I am looking for a way for a pseudo spider website. The key is that I really don't want content, but rather a simple list of URIs. I can get closer to this idea with Wget using the --spider , but when the pipe that is output via grep , I can't seem to find the correct magic to make it work:

 wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

The grep filter does not seem to affect wget output at all. Is there something wrong with me, or is there another tool I should try to focus more on providing such a limited set of results?

UPDATE

So, I just discovered offline that by default wget writes stderr. I skipped this in the manual pages (in fact, I still haven't found it if it's there). As soon as I returned to the standard, I came close to what I need:

 wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

I was still interested in other / better remedies for this kind of thing, if they exist.

+44

uri grep web-crawler wget

Rob Wilkerson May 10 '10 at 4:37 p.m.

source share

4 answers

Create multiple regular expressions to extract addresses from all

 <a href="(ADDRESS_IS_HERE)">.

Here is the solution I would use:

 wget -q http://example.com -O - | \ tr "\t\r\n'" ' "' | \ grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \ sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

This will remove all http, https, ftp and ftps links from the web page. This will not give you relative URLs, only full URLs.

Explanation regarding the parameters used in the series of commands with channels:

wget -q has no redundant output (silent mode). wget -O - makes it so that the downloaded file is reflected on stdout, and not saved on disk.

tr is the unix character translator used in this example to translate strings and tabs to spaces, and to convert single quotes to double quotes so that we can simplify our regular expressions.

grep -i makes the case insensitive search grep -o prints only the relevant parts.

sed is a Stream EDitor unix utility that allows you to perform filtering and conversion operations.

sed -e just lets you express your expression.

Running this little script on " http://craigslist.org gave a pretty long list of links:

 http://blog.craigslist.org/ http://24hoursoncraigslist.com/subs/nowplaying.html http://craigslistfoundation.org/ http://atlanta.craigslist.org/ http://austin.craigslist.org/ http://boston.craigslist.org/ http://chicago.craigslist.org/ http://cleveland.craigslist.org/ ...

+19

Jay Taylor May 10 '10 at 17:06

source share

I used a tool called xidel

 xidel http://server -e '//a/@href' | grep -v "http" | sort -u | xargs -L1 -I {} xidel http://server/{} -e '//a/@href' | grep -v "http" | sort -u

A bit of hacks, but brings you closer! This is only the first level. Imagine packing this in a self-recursive script!

+4

Richard May 9 '13 at 16:37

source share

See this question / answer for another way to do this with a python script: How to use the Python Scrapy module to display all the URLs from my site?

+1

Adam F Mar 06

source share

Rob Wilkerson · Accepted Answer · 2010-05-14 11:52

The absolute last thing I want to do is download and parse all the content myself (i.e. create your own spider). As soon as I found out that Wget writes stderr by default, I was able to redirect it to stdout and filter the output accordingly.

 wget --spider --force-html -r -l2 $url 2>&1 \ | grep '^--' | awk '{ print $3 }' \ | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \ > urls.m3u

This gives me a list of resource resources (resources that are not images, CSS or JS source files) URIs that are being examined. From there, I can send the URI to a third-party processing tool to suit my needs.

The output still needs to be wrapped around a bit (it creates duplicates, as shown above), but it is almost there, and I did not have to figure it out myself.

Spider for websites and return URLs only

More articles: