Using awk sed or grep to parse urls from a web page source

Question

Using awk sed or grep to parse urls from a web page source

I am trying to parse the source of a loaded webpage to get a list of links. One liner will work fine. Here is what I have tried so far:

It looks like some urls are not listed in the url.

$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3

This gets the whole url, but I don't want to include links that have / are anchor links. I also want to be able to specify domain.org/folder/:

 $ awk 'BEGIN{ RS="</a>" IGNORECASE=1 } { for(o=1;o<=NF;o++){ if ( $o ~ /href/){ gsub(/.*href=\042/,"",$o) gsub(/\042.*/,"",$o) print $(o) } } }' file.html

+4

scripting regex grep awk sed

Astron Mar 20 '11 at 15:01

source share

2 answers

 lynx -dump http://www.ibm.com

Look for the "Links" line in the output. Post process with sed if you need to.

Using another tool sometimes simplifies the job. From time to time, another tool simplifies the job. This is one of those cases.

+2

Mike Sherrill 'Cat Recall' Mar 21 '11 at 2:27

source share

mjbommar · Accepted Answer · 2011-03-20T15:19:36+0000

If you are only parsing something like <a>, you can simply map the href attribute as follows:

 $ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq

This ignores the anchor and also ensures that you have uniqueness. This assumes the page has well-formed (X) HTML, but you can transfer it through Tidy in the first place.

Using awk sed or grep to parse urls from a web page source

More articles: