Create multiple regular expressions to extract addresses from all
<a href="(ADDRESS_IS_HERE)">.
Here is the solution I would use:
wget -q http://example.com -O - | \ tr "\t\r\n'" ' "' | \ grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \ sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
This will remove all http, https, ftp and ftps links from the web page. This will not give you relative URLs, only full URLs.
Explanation regarding the parameters used in the series of commands with channels:
wget -q has no redundant output (silent mode). wget -O - makes it so that the downloaded file is reflected on stdout, and not saved on disk.
tr is the unix character translator used in this example to translate strings and tabs to spaces, and to convert single quotes to double quotes so that we can simplify our regular expressions.
grep -i makes the case insensitive search grep -o prints only the relevant parts.
sed is a Stream EDitor unix utility that allows you to perform filtering and conversion operations.
sed -e just lets you express your expression.
Running this little script on " http://craigslist.org gave a pretty long list of links:
http://blog.craigslist.org/ http://24hoursoncraigslist.com/subs/nowplaying.html http://craigslistfoundation.org/ http://atlanta.craigslist.org/ http://austin.craigslist.org/ http://boston.craigslist.org/ http://chicago.craigslist.org/ http://cleveland.craigslist.org/ ...
Jay Taylor May 10 '10 at 17:06 2010-05-10 17:06
source share