Using awk sed or grep to parse urls from a web page source

I am trying to parse the source of a loaded webpage to get a list of links. One liner will work fine. Here is what I have tried so far:

It looks like some urls are not listed in the url.

$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3 

This gets the whole url, but I don't want to include links that have / are anchor links. I also want to be able to specify domain.org/folder/:

 $ awk 'BEGIN{ RS="</a>" IGNORECASE=1 } { for(o=1;o<=NF;o++){ if ( $o ~ /href/){ gsub(/.*href=\042/,"",$o) gsub(/\042.*/,"",$o) print $(o) } } }' file.html 
+4
source share
2 answers

If you are only parsing something like <a>, you can simply map the href attribute as follows:

 $ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq 

This ignores the anchor and also ensures that you have uniqueness. This assumes the page has well-formed (X) HTML, but you can transfer it through Tidy in the first place.

+8
source
 lynx -dump http://www.ibm.com 

Look for the "Links" line in the output. Post process with sed if you need to.

Using another tool sometimes simplifies the job. From time to time, another tool simplifies the job. This is one of those cases.

+2
source

Source: https://habr.com/ru/post/1344466/


All Articles