[ In a hurry? Then go straight to usage! ]
The Internet Archive provides at web.archive.org an exceedingly popular service for accessing web-specific content, such as an article, that is no longer available on its original website. Furthermore, the service supports browsing (playback of) sites as they were, in some cases going back as far as 1996, which can be a source of considerable historical interest. For example, the first capture of the Museum of the History of Science (MHS) website was on 19 December 1996, available from the Internet Archive at https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/.
There is, however, no option to ‘download this website’ beyond the browser’s options of saving individual pages. Nevertheless, the assistance of appropriate tools opens up the possibility of recovering larger amounts of information, even entire sites. This angle is reflected in a list of available solutions compiled by the ArchiveTeam, which uses the term restoration in its page title, clearly implying a motivation, such as recovery from accidental data loss or the desire to revisit or renew a project.
As this is entering the realms of website (re-)creation, particularly from a static source, this naturally falls within the remit of MakeStaticSite. But if the Wayback Machine and these other solutions already provides access, why bother with another tool? For even a long-established service such as the Internet Archive, there is no guarantee it will continue to operate to the same degree. There are various associated risks: funding (to maintain the high performance systems and data centres), legal (particularly around copyright), and security (there may be knock-on effects for security breaches, making services vulnerable indirectly). So, it may well be a race against time for individuals and organisations to preserve sites in such a way that they continue to be available.
Native Support for Wayback
Given the solutions already offered, as mentioned above, why the need for native support in MakeStaticSite? After exploring some of the options, the reasons include the following:
- There are few free and currently maintained software solutions that are amenable to incorporation in or linking from MakeStaticSite.
- From the selection available, integration has been achieved for Hartator’s Wayback Machine Downloader, as described in the context of Wayback Machine as a service. However, the results have been mixed, generally not as complete as expected, even after downloading numerous snapshots. Furthermore, this tool is restricted to the Internet Archive.
- Most of the available solutions generally download a series of snapshots in a specific archive format. There remains the considerable task of reassembling these snapshots as a coherent, navigable and fairly complete website.
- An alternative approach is a paid service that uses Web scraping. It states that the output needs to be deployed on a service, but for MakeStaticSite it is important that it works offline. Even so, a small free sample showed clean output, suggesting that querying the Wayback Machine through its Playback Web interface is a promising approach.
As the Playback system carries out the complex assembly work, it makes a lot of sense to piggyback off that. Accordingly, since version 0.30, MakeStaticSite development effort in relation to the Wayback Machine is being concentrated on a native approach that queries the Playback service. Furthermore, the particular challenges facing Wget around URL formats (using the -np flag) can be circumvented by an iterative process of parsing pages, determining lists of further URLs to fetch, and then postprocessing to gather the downloads in a sensible fashion. Such URLs generally contain both the original URL and the timestamp of a capture. We can filter what further to download based on the timestamp without needing to understand in detail how it was assembled. Mirroring such content with Wget should also enable the retrieved files to work offline as well as online.
So far, MakeStaticSite provides limited support for fetching web content from any site running a Wayback Machine, not just web.archive.org. However, at present, it can only retrieve pages captured at a precise timestamp, albeit with associated assets retrieved from any date. This may mean the output is just a single page or dozens of pages with supporting page elements.
However, in principle this can be extended; what’s available in the current version is just a start.
Configuration
For native handling of Wayback Machine content, in constants.sh, set the client switch to ‘no’:
wayback_cli=no
Usage
Having found the required Wayback URL, as described above, usage is the same as with any other URL. For a shortcut using default settings, you can run the following:
$ ./setup.sh -u https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/
MakeStaticSite will proceed to check the response header from the site and report if it is serving mementos, which is assumed to indicate a Wayback Machine.
The mirror is generated as with other URLs, with a few additional options:
- reverting to the domain of the captured site instead of retaining the Wayback Machine host name. (This is not yet reflected in the zip file name.)
- removal of the (Playback) code inserted by the Wayback Machine
- removal of comments inserted by the Wayback Machine.
To illustrate the kind of output that MakeStaticSite generates compare the Wayback URL: https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/ with the corresponding MakeStaticSite output.
Date Ranges
Date ranges limit the output to content that was fetched by the Wayback Machine between those dates (by ‘date‘ we mean date and time).
This can be specified using the same commands with just a slight variation, replacing a single timestamp with two timestamps separated by a semicolon. Hence:
$ ./setup.sh -u https://web.archive.org/web/19961219005900-19970412232929/http://www.mhs.ox.ac.uk/sphaera/issue1/text.htm
The initial run on MakeStaticSite will be carried out on the from date (i.e. 19961219005900 above). As this uses the ‘page requisites’ option of Wget, it may fetch content from other timestamps, but subsequent runs will constrain content fetches to the range.
Archive Directory Naming
With Wayback URLs, there are options to specify precisely the name format of the mirror archive directory (stored in the variable, mirror_archive_dir). It is possible to record the primary snapshot generated by the Wayback Machine, as specified in url, the timestamp of MakeStaticSite’s capture, along with the Wayback host domain and the target primary domain.
These are determined by two constants: wayback_sitename_hosts and wayback_sitename_timestamps. The options selected will likely depend on usage scenarios.
The host options are:
- [default] wayback_sitename_hosts=wayback (or
empty string) sets the stem based on the Wayback host
name
E.g., web.archive.org - wayback_sitename_hosts=primary sets the stem
based on the target primary domain
E.g., www.example.org - wayback_sitename_hosts=both sets the stem
based on the Wayback host name
E.g., web.archive.org-www.example.org
The timestamp options specify timestamps according to the Wayback URL entered (which can include date ranges) and the run of MakeStaticSite. Note that MakeStaticSite timestamps may be distinguished by their use of the underscore character.
They are tied to the host options above. Hence:
- [default]
wayback_sitename_timestamps=wayback (or empty
string) appends the ‘from'[-‘to’] Wayback
timestamp[s]
E.g., web.archive.org19970412232929
web.archive.org-www.mhs.ox.ac.uk19970412232929
I.e., Wayback Machine archive of www.mhs.ox.ac.uk, with primary snapshot on 12 April 1997 at 23:29:29, captured by MakeStaticSite from web.archive.org (no timestamp). - wayback_sitename_timestamps=mss appends the
MakeStaticSite timestamp
E.g., web.archive.org20240912_160154
I.e., Wayback Machine archive captured by MakeStaticSite from web.archive.org on 12 September 2024 at 16:01:54 (local time). - wayback_sitename_hosts=both appends the
‘from'[-‘to’] Wayback timestamp[s] to the Wayback host
portion and the MakeStaticSite timestamp to the target
domain portion.
E.g., web.archive.org20240912_160154-www.mhs.ox.ac.uk19970412232929
I.e., Wayback Machine archive of www.mhs.ox.ac.uk, with primary snapshot on 12 April 1997 at 23:29:29, captured by MakeStaticSite from web.archive.org on 12 September 2024 at 16:01:54 (local time).
These options will likely prove more useful when carrying out more captures and repeat captures.
Wayback Tidy Options
There are several options that determine the extent to which output from the Wayback Machine is cleaned.
- wayback_code_clean: when set to yes, deletes the JavaScript Playback code inserted by the Wayback Machine in the head of the document.
- wayback_comments_clean: when set to yes, deletes HTML comments inserted by Wayback Machine in the footer.
- wayback_links_clean: when set to yes, restore the original URL links in web pages, removing Wayback prefixes in these URLs.
To restore a website to its original domain, set all three options to yes. This is more appropriate for newly archived sites since older sites will like have links to expired content, in which case, for continuity of browsing, it is recommended to set wayback_links_clean=no.
Splitting a mirror into multiple runs
(Please note that this is putative and not yet tested.)
For larger sites, it may be possible to split the task of site mirroring by directory, with separate runs for each directory based on Wget’s --no-parent option. This will output a site to the original path (so will need to drill down to access it) and generate a sitemap that reflects it. This task can then be distributed to a group of users, each user apportioned a particular run (or runs). On completion, the outputs can be merged, with some care.
As a prerequisite, change the default settings in constants.sh to ensure that MakeStaticSite doesn’t move the output to the root directory:
mss_cut_dirs=no
Each run will require its own configuration file, with the assigned URL. A straightforward arrangement is to select top-level directories, such as:
https://example.com/directory1/
https://example.com/directory2/
...
https://example.com/directoryn/
For these directories, the default or existing configuration settings may usually be left as they are.
The remaining content is the root directory. Here, the crawl should be limited (or else why divvy up the task?) to run Wget non-recursively. For this, remove or limit the recursive option from Wget, by changing the wget_mirror_options constant. For example, to allow a crawl of one further step, set:
wget_mirror_options=(--recursive --timestamping --level=1 --no-remove-listing)
The value of level will depend on how the navigation is set up, with the aim of aim reaching every page in the root directory (the --page-requisites option in wget_core_options) should ensure that supporting assets are fetched automatically).
Also, for the root directory, there is probably no need to run phase 3, so set it to not carry out further crawling:
wget_extra_urls_depth=0
Site re-assembly
An outline of a re-assembly process:
- Create a working directory for the re-assembly.
- Within that create a directory sources, and extract/copy the individual mirror outputs inside, one directory for each output.
- In the root directory, create a sitemap.xml as a merge of the individual sitemap.xml files in the respective output directories; copy the robots.txt file from the original root directory.
- Choose a fresh directory, where the site is to be re-assembled
- Copy over root directory output followed by the other directory outputs. It is likely that you will encounter duplication.
- Copy over the merged sitemap.xml file.
- Tidy up.
All being well, this will yield a fully reassembled and navigable static site.