The Wayback Machine


The Internet Archive provides at web.archive.org an exceedingly popular service for accessing web-specific content, such as an article, that is no longer available on its original website. Furthermore, the service supports browsing (playback of) sites as they were, in some cases going back as far as 1996, which can be a source of considerable historical interest. For example, the first capture of the Museum of the History of Science (MHS) website was on 19 December 1996, available from the Internet Archive at https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/.

At the same time, with the assistance of appropriate tools, this opens up the possibility of recovering larger amounts of information, even entire sites. This angle is reflected in a list of available solutions compiled by the ArchiveTeam, which uses the term restoration in its page title, clearly implying a motivation, such as recovery from accidental data loss or the desire to revisit or renew a project. As this is entering the realms of website (re-)creation, particularly from a static source, this naturally falls within the remit of MakeStaticSite.

But if the Wayback Machine already provides access, why bother with another tool? For even a long-established service such as the Internet Archive, there is no guarantee it will continue to operate to the same degree. There are various associated risks: funding (to maintain the high performance systems and data centres), legal (particularly around copyright), and security (there may be knock-on effects for security breaches, making services vulnerable indirectly). So, it may well be a race against time for individuals and organisations to preserve sites in such a way that they continue to be accessible in the event of such issues.

First we provide some orientation to see what we are dealing with.

Wayback as software

The specific Web-based service is called the Wayback Machine, which is what most people remember and how they identify it. However, it is, more generally, a term for a type of software package that enables this kind of web memory storage and retrieval. In its long history (by Web standards), there have been various open source projects delivering such software. We’ll just mention one: Wayback, produced by The Internet Archive, available from GitHub. Written in Java, it comprises three components:

  • A core application which accesses and makes available data retrieved by the Heritrix web crawler and archived in standard formats such as ARC or WARC.
  • A Content Index (CDX) server, accessible via API calls, which returns archive content index (CDX) files, in response to site queries subject to constraints such as URL and date range.
  • A Web app, whose playback interface is what most users experience, at web.archive.org. It also serves snapshots via an API; basic usage is helpfully summarised on Wikipedia.

From this (and analogously with other projects), there are basically two approaches to generate snapshots of web sites.

  1. Use an API to query the archive indexes to produce a list of the requisite snapshots. Then download and assemble the snapshots for playback.
  2. Use a web crawler to scrape the playback user interface.

There are pros and cons to both. The first option is more rigorous, providing direct access to indexes of the source archives and endpoints for retrieving original snapshots. There are some open source tools available to work with these archives using various APIs available, including support for the Memento framework, particularly its “Time Travel for the Web” protocol, recognisable by its distinct URLs, where a timestamp is sandwiched by an archive service host domain and the captured domain respectively, as shown for MHS above. However, the archives themselves have evolved over time, along with how they were sourced and indexed.

Especially relevant for offline web mirrors, is that by design, Web pages on the Wayback Machine are based on snapshots, gathered at various times, delivered within a calendar framework, whose playback may be viewed as a kind of virtual machine configured to maintain as much online navigation as possible. Reproducing a comparable playback experience from the components thus downloaded requires complex assembly. All in all, it means that no matter how good the API and the client implementation, queries can yield inconsistent or partial content and the resulting sites may be incomplete or not function as expected.

The second option, which requires a suitable tool to crawl the Wayback Machine’s web site thoroughly, largely avoids the need to assemble components. albeit is rudimentary in comparison. Whereas the Wayback Machine’s web interface provides a straightforward means to download a single page or file via a desktop browser (using a specially crafted URL), it doesn’t offer an extended facility for interlinked pages. Furthermore, there is no freely available crawler that is specifically designed to generate offline snapshots. So, custom software development or the use of other third-party tools or services is needed.

Both options dependent on service availability (via http); the second being more widely used, is probably more vulnerable.

Support in MakeStaticSite

MakeStaticSite is in its early stages of supporting the retrieval and restoration of content using either method. It is relatively straightforward to invoke command-line clients and, with cURL and Wget, it is also feasible to crawl Wayback Machine sites. So far, the coverage is partial, but the goal is to handle entire sites (with content retrieved constrained to a particular date range).

Wayback Machine Downloader

Support for the API approach was introduced in version 0.27, initially leveraging Wayback Machine Downloader, a tool written in Ruby. Note that it only supports the Internet Archive’s service, i.e. web.archive.org

We are interested to know: To what extent does querying the CDX server provide a solution? In particular, how complete is its coverage?

Installation

Follow the instructions provided to install the binary, making a note of its name (the default is wayback_machine_downloader).

MakeStaticSite Configuration

In constants.sh, set the client switch to ‘yes’ and specify the filename of the binary:

wayback_cli=yes
wayback_machine_downloader_cmd=wayback_machine_downloader

Usage

The simplest way to create a mirror of a Wayback Machine archive is to browse the Wayback Machine web interface and copy the URL of a particular archive snapshot, which is usually available in the Memento format. For example, the first snapshot for the Museum of the History of Science (as it was then known) can be viewed at: https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/.

Then run the MakeStaticSite setup script with this URL as parameter.

$ ./setup.sh -u https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/

The -u flag specifies ‘run unattended’, i.e. non-interactively. Various assumptions will be made as it creates and then uses a configuration file to build a mirror, which in this case comprises precisely one file — the index page. It’s not necessary to supply a complete timestamp with all 14 numerals. A stem suffices, e.g. 1996.

At the other end of the scale, an attempt to download all files without date restrictions can result in a network socket error or else yield tens of thousands of files. This can be checked by using the Wayback Machine Downloader directly:

$ wayback_machine_downloader -l -s http://www.mhs.ox.ac.uk > wayback_cdx_mhs.txt

(The -l flag specifies listing only; -s flag requests all all snapshots/timestamps.)

Date Ranges

A balance may be struck in how much is downloaded by specifying a date range. This can be specified in constants.sh by setting the values for wayback_date_from and wayback_date_to respectively.

For convenience, a date range may be specified directly for processing by setup.sh and makestaticsite.sh. This uses a custom URL that extends the Wayback URLs by including two timestamps separated by a hyphen: https://web.archive.org/web/timestamp_from-timestamp_to/http://www.example.com/.

For example, to confine results for the Museum of the History of Science between 2009 and July 2012, use:

$ ./setup.sh -u https://web.archive.org/web/2009-201207/http://www.mhs.ox.ac.uk/

If in doubt, cast the net wider.

Findings

For sites with more than a few pages, browsing the output will typically be less satisfactory compared with the original. As remarked above, the Wayback Machine’s playback is designed to piece together snapshots gathered at different dates, seeking to maintaining an online user experience as close to the original as it can. The Wayback Machine Downloader doesn’t offer equivalent functionality and it’s not known if other command-line clients provide this, particularly for offline working.

In handing over the site capture to a third party, makestaticsite.sh is ignorant of that process. In terms of phases, it effectively means bypassing phases 2 and 3, so that only in phase 4 are the rains handed back to the MakeStaticSite script. At this stage, MakeStaticSite carries out a few post-crawl refinements, such as the conversion of some internal links to maintain navigation and refinement of pages with HTML Tidy. However, it does not retrieve any further content, so it can’t fill in the gaps.

On the other hand, if a solution is based on Wget, then much of the mirroring functionality that is core to Wget, such as the handling of non-HTML file extensions with corresponding link conversion and downloading page requisites, could fill in gaps. This motivates an extension to MakeStaticSite to incorporate these features natively.

Native Support for Wayback

MakeStaticSite provides native support for any site running a Wayback Machine, not just web.archive.org. However, at present, it can only retrieve pages captured at a precise timestamp, so it cannot yet handle date ranges. This may mean the output is just a single page or dozens of pages with supporting page elements.

However, in principle this can be extended; what’s available in the current version it is just a start.

MakeStaticSite Configuration

For native handling of Wayback Machine content, in constants.sh, set the client switch to ‘no’:

wayback_cli=no

Usage

Having found the required Wayback URL, as described above, usage is the same as with any other URL. MakeStaticSite will proceed to check the response header from the site and report if it is serving mementos, which is assumed to indicate a Wayback Machine.

The mirror is generated as with other URLs, except that currently the name of the archive directory is based on the Wayback Machine host name. An option should be introduced to revert to the domain of the captured site.

Further Development

A general Web crawler such as Wget is able to download (or scrape) the contents of the Wayback Machine subject to guidance on ‘Where to go next?’ (crawling everything under the archive.org domain is clearly not feasible). The answer to that question is determined by the canonical URLs for timestamped sites, i.e. the Wayback URL contains in its path both the original URL and the timestamp of a capture. It is not necessary to understand in detail how it was assembled. Furthermore, mirroring such content with Wget should enable the retrieved files to work offline as well as online.

This page was published on 4 May 2024 and last updated on 22 May 2024.