I am trying to parse the source of a loaded webpage to get a list of links. One liner will work fine. Here is what I have tried so far:
It looks like some urls are not listed in the url.
$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3
This gets the whole url, but I don't want to include links that have / are anchor links. I also want to be able to specify domain.org/folder/:
$ awk 'BEGIN{ RS="</a>" IGNORECASE=1 } { for(o=1;o<=NF;o++){ if ( $o ~ /href/){ gsub(/.*href=\042/,"",$o) gsub(/\042.*/,"",$o) print $(o) } } }' file.html
source share