Using wget to mirror a website using a path and subfolder having the same name

I'm trying to make a website mirror, but the URLs include several paths that overlap when copying to files on disk in the usual way wget. The problem manifests itself in URLs such as http://example.com/newsand http://example.com/news/article1.

Wget loads these URLs as /newsand /news/article1, but this means that the file is /newsoverwritten by the folder with the same name.

For a proper static mirror, these two URLs will need to load instead of /news/index.htmland /news/article1.

I tried to work around this problem by running wgettwice and moving the files accordingly, but this did not work for me. The path /newshas links to /news/article1which you need to convert. I use the option -kto convert links, but if I run wgetit twice , it does not convert links between these unrelated downloaded files.

Here is my command:

wget -p -r -l4 -k -d -nH http://example.com

Here is an example of working around what I tried:

# wget once at first level (gets /news path but not /news/*)
wget -p -r -l1 -k -nH http://example.com

# move /news file to temp path
mv news /tmp/news.html

# wget again to get everything else (notice the different level value)
wget -p -r -l4 -k -nH http://example.com

# move temp path back to /news/index.html
mv /tmp/news.html news/index.html

In the above example, the links on the page /newsthat should point to /news/article1were not converted.

Does anyone know how to get around this with wget? Is there any other tool that will work better?

+4
source share
1

!

, /news/index.html URL, . man- , -E (--adjust-extension) . wget .html HTML, .

, -k , 100% - , .

:

http://example.com/news           -->  /news.html
http://example.com/news/article1  -->  /news/article1.html

, . - ( , Apache), , http://sitemirror.com/news/article1 /news/article1.html. , , , http:/sitemirror.com/news 404 . .

, wget:

wget -p -r -l4 -E -k -nH http://example.com
+3

Source: https://habr.com/ru/post/1569475/


All Articles