[ In a hurry? Then go straight to usage! ]
The Internet Archive provides at web.archive.org an exceedingly popular service for accessing web-specific content, such as an article, that is no longer available on its original website. Furthermore, the service supports browsing (playback of) sites as they were, in some cases going back as far as 1996, which can be a source of considerable historical interest. For example, the first capture of the Museum of the History of Science (MHS) website was on 19 December 1996, available from the Internet Archive at https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/.
There is, however, no option to ‘download this website’ beyond the browser’s options of saving individual pages. Nevertheless, the assistance of appropriate tools opens up the possibility of recovering larger amounts of information, even entire sites. This angle is reflected in a list of available solutions compiled by the ArchiveTeam, which uses the term restoration in its page title, clearly implying a motivation, such as recovery from accidental data loss or the desire to revisit or renew a project.
As this is entering the realms of website (re-)creation, particularly from a static source, this naturally falls within the remit of MakeStaticSite. But if the Wayback Machine and these other solutions already provides access, why bother with another tool? For even a long-established service such as the Internet Archive, there is no guarantee it will continue to operate to the same degree. There are various associated risks: funding (to maintain the high performance systems and data centres), legal (particularly around copyright), and security (there may be knock-on effects for security breaches, making services vulnerable indirectly). So, it may well be a race against time for individuals and organisations to preserve sites in such a way that they continue to be available.
Native Support for Wayback
Given the solutions already offered, as mentioned above, why the need for native support in MakeStaticSite? After exploring some of the options, the reasons include the following:
- There are few free and currently maintained software solutions that are amenable to incorporation in or linking from MakeStaticSite.
- From the selection available, integration has been achieved for Hartator’s Wayback Machine Downloader, as described in the context of Wayback Machine as a service. However, the results have been mixed, generally not as complete as expected, even after downloading numerous snapshots. Furthermore, this tool is restricted to the Internet Archive.
- Most of the available solutions generally download a series of snapshots in a specific archive format. There remains the considerable task of reassembling these snapshots as a coherent, navigable and fairly complete website.
- An alternative approach is a paid service that uses Web scraping. It states that the output needs to be deployed on a service, but for MakeStaticSite it is important that it works offline. Even so, a small free sample showed clean output, suggesting that querying the Wayback Machine through its Playback Web interface is a promising approach.
As the Playback system carries out the complex assembly work, it makes a lot of sense to piggyback off that. Accordingly, since version 0.30, MakeStaticSite development effort in relation to the Wayback Machine is being concentrated on a native approach that queries the Playback service. Furthermore, the particular challenges facing Wget around URL formats (using the -np flag) can be circumvented by an iterative process of parsing pages, determining lists of further URLs to fetch, and then postprocessing to gather the downloads in a sensible fashion. Such URLs generally contain both the original URL and the timestamp of a capture. We can filter what further to download based on the timestamp without needing to understand in detail how it was assembled. Mirroring such content with Wget should also enable the retrieved files to work offline as well as online.
So far, MakeStaticSite provides limited support for fetching web content from any site running a Wayback Machine, not just web.archive.org. However, at present, it can only retrieve pages captured at a precise timestamp, albeit with associated assets retrieved from any date. This may mean the output is just a single page or dozens of pages with supporting page elements.
However, in principle this can be extended; what’s available in the current version is just a start.
MakeStaticSite Configuration
For native handling of Wayback Machine content, in constants.sh, set the client switch to ‘no’:
wayback_cli=no
Usage
Having found the required Wayback URL, as described above, usage is the same as with any other URL. For a shortcut using default settings, you can run the following:
$ ./setup.sh -u https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/
MakeStaticSite will proceed to check the response header from the site and report if it is serving mementos, which is assumed to indicate a Wayback Machine.
The mirror is generated as with other URLs, with a few additional options:
- reverting to the domain of the captured site instead of retaining the Wayback Machine host name. (This is not yet reflected in the zip file name.)
- removal of the (Playback) code inserted by the Wayback Machine
- removal of comments inserted by the Wayback Machine.
To illustrate the kind of output that MakeStaticSite generates compare the Wayback URL: https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/ with the corresponding MakeStaticSite output: MHS site mirror (zip file)