How to get urls from html file?

I need to get a long list of valid URLs for testing my DNS server. I found a webpage with a lot of links in it that would probably bring quite a few good links (http://www.cse.psu.edu/~groenvel/urls.html), and I thought the easiest way to do it will be loading the HTML file and just grep for the urls. However, I can’t get him to list my results only by reference.

I know that there are many ways to do this. I am not choosy about how this is done.

Given the above URL, I need a list of all the URLs (one per line), for example:

http://www.cse.psu.edu/~groenvel/
http://www.acard.com/
http://www.acer.com/
...

+3
source share
2 answers

Method 1

Step 1:

wget "http://www.cse.psu.edu/~groenvel/urls.html"

Step 2:

perl -0ne 'print "$1\n" while (/a href=\"(.*?)\">.*?<\/a>/igs)' /PATH_TO_YOUR/urls.html | grep 'http://' > /PATH_TO_YOUR/urls.txt

Just replace " / PATH_TO_YOUR / " with your file path. This will give a text file with only urls.

Method 2

If you have lynx installed, you can just do it in one step:

Step 1:

lynx --dump http://www.cse.psu.edu/~groenvel/urls.html | awk '/(http|https):\/\// {print $2}' > /PATH_TO_YOUR/urls.txt

Method 3

Using curl:

Step1

curl http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o  "(http|https):.*\">" | awk  'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt

Method 4

Using wget:

wget -qO- http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o  "(http|https):.*\">" | awk  'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt
+7
source

you need wget, grep, sed. I will try the solution and update the message later.

Update:

wget [the_url];

cat urls.html | egrep -i '<a href=".*">' | sed -e 's/.*<A HREF="\(.*\)">.*/\1/i'
+1
source

Source: https://habr.com/ru/post/1792156/


All Articles