Method 1
Step 1:
wget "http://www.cse.psu.edu/~groenvel/urls.html"
Step 2:
perl -0ne 'print "$1\n" while (/a href=\"(.*?)\">.*?<\/a>/igs)' /PATH_TO_YOUR/urls.html | grep 'http://' > /PATH_TO_YOUR/urls.txt
Just replace " / PATH_TO_YOUR / " with your file path. This will give a text file with only urls.
Method 2
If you have lynx installed, you can just do it in one step:
Step 1:
lynx --dump http://www.cse.psu.edu/~groenvel/urls.html | awk '/(http|https):\/\// {print $2}' > /PATH_TO_YOUR/urls.txt
Method 3
Using curl:
Step1
curl http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o "(http|https):.*\">" | awk 'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt
Method 4
Using wget:
wget -qO- http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o "(http|https):.*\">" | awk 'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt
source
share