How do you archive an entire site for offline viewing?

In fact, we have burned static / archived copies of our asp.net sites for clients many times. We have used WebZip so far, but we have had endless problems with crashes, loaded pages that were not re-linked correctly, etc.

We basically need an application that scans and downloads static copies of everything on our asp.net website (pages, images, documents, CSS, etc.), and then processes the downloaded pages so that they can be viewed locally without an Internet connection (get rid of absolute URLs in links, etc.). The more idiotic evidence, the better. This seems like a fairly common and (relatively) simple process, but I tried several other applications and was really not impressed

Does anyone have archiving software that they would recommend? Does anyone have a really simple process to share?

+49
html web-crawler archive
Feb 11 '09 at 21:22
source share
10 answers

On Windows, you can watch HTTrack . It is highly customizable, allowing you to set the download speed. But you can just point it to the website and run it too without any configuration.

In my experience, it was a really good tool and works well. Some of the things I like about HTTrack are:

  • Open source license
  • Summary of stopped downloads
  • Can update an existing archive
  • You can configure it to be non-aggressive when it loads so that it does not lose the bandwidth and bandwidth of the site.
+35
Feb 11 '09 at 21:34
source share
β€” -

You can use wget :

wget -m -k -K -E http://url/of/web/site 
+64
Feb 11 '09 at 21:25
source share

Hartator's Wayback Machine Downloader is simple and fast.

Install through Ruby, then run with the desired domain and additional timestamp from the Internet archive .

 sudo gem install wayback_machine_downloader mkdir example cd example wayback_machine_downloader http://example.com --timestamp 19700101000000 
+5
Nov 02 '15 at 1:07
source share

I use Blue Crab on OSX and WebCopier on Windows.

+4
Feb 11 '09 at 21:26
source share

wget -r -k

... and explore the rest of the options. I hope you followed these recommendations: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html so that all your resources are safe with GET requests.

+2
Feb 11 '09 at 21:26
source share

I just use: wget -m <url> .

+1
Feb 11 '09 at 21:25
source share

For users of OS X, I found that the sitesucker application found here works without any settings, except for the links.

+1
Apr 24 '13 at 14:24
source share

If your customers archive compliance issues, you want to make sure that the content can be authenticated. The options listed are for easy viewing, but they are not legal. In this case, you are looking for timestamps and digital signatures. It is much more difficult if you do it yourself. I would suggest a service like PageFreezer .

+1
Mar 09 '15 at 18:23
source share

I have been using HTTrack for several years now. It processes all interstitial links, etc. Just great. My only complaint is that I did not find a good way to keep it a limited sub-site very well. For example, if there is a site www.foo.com/steve that I want to archive, it will most likely follow the links to www.foo.com/rowe and archive it too. Otherwise it's great. High configuration and reliability.

0
Feb 11 '09 at 21:58
source share

Also check out ArchiveBox (formerly Bookmark Archiver).

This is a local open source web archiving program that can download bookmarks, browser history, RSS feeds, etc.

0
Dec 21 '18 at 23:31
source share



All Articles