Wayback Machine as a Service

The specific Web-based service is called the Wayback Machine, which is what most people remember and how they identify it. However, it is, more generally, a term for a type of software package that enables this kind of web memory storage and retrieval. In its long history (by Web standards), there have been various open source projects delivering such software. We’ll just mention one: Wayback, produced by The Internet Archive, available from GitHub. Written in Java, it comprises three components:

A core application which accesses and makes available data retrieved by the Heritrix web crawler and archived in standard formats such as ARC or WARC.
A Content Index (CDX) server, accessible via API calls, which returns archive content index (CDX) files, in response to site queries subject to constraints such as URL and date range.
A Web app, whose playback interface is what most users experience, at web.archive.org. It also serves snapshots via an API; basic usage is helpfully summarised on Wikipedia.

From this (and analogously with other projects), there are basically two approaches to generate snapshots of web sites.

Use an API to query the archive indexes to produce a list of the requisite snapshots. Then download and assemble the snapshots for playback.
Use a web crawler to scrape the playback user interface.

There are pros and cons to both. The first option is more rigorous, providing direct access to indexes of the source archives and endpoints for retrieving original snapshots. There are some open source tools available to work with these archives using various APIs available, including support for the Memento framework, particularly its “Time Travel for the Web” protocol, recognisable by its distinct URLs, where a timestamp is sandwiched by an archive service host domain and the captured domain respectively, as shown for MHS above. However, the archives themselves have evolved over time, along with how they were sourced and indexed.

Especially relevant for offline web mirrors, is that by design, Web pages on the Wayback Machine are based on snapshots, gathered at various times, delivered within a calendar framework, whose playback may be viewed as a kind of virtual machine configured to maintain as much online navigation as possible. Reproducing a comparable playback experience from the components thus downloaded requires complex assembly. All in all, it means that no matter how good the API and the client implementation, queries can yield inconsistent or partial content and the resulting sites may be incomplete or not function as expected.

The second option, which is the native solution for MakeStaticSite, albeit rudimentary in comparison, largely avoids the need to assemble components. The assumption is that repeated use of Wget enables the Wayback Machine’s web site to be crawled thoroughly and that further processing can generate output in a way that is amenable to a standard web browser. It overcomes the limitation in the Playback service of limiting downloads (via web browsers) to individual pages (using a specially crafted URL).

We consider below the use of Hartator’s Wayback Machine Downloader as an example of option one, i.e. using a third-party tool or service. We just remark that all solutions dependent on service availability (via http).

Wayback Machine Downloader

Support for the API approach was introduced in version 0.27, initially leveraging Wayback Machine Downloader, a tool written in Ruby. Note that it only supports the Internet Archive’s service, i.e. web.archive.org

We are interested to know: To what extent does querying the CDX server provide a solution? In particular, how complete is its coverage?

Installation

Follow the instructions provided to install the binary, making a note of its name (the default is wayback_machine_downloader).

MakeStaticSite Configuration

In constants.sh, set the client switch to ‘yes’ and specify the filename of the binary:

wayback_cli=yes
wayback_machine_downloader_cmd=wayback_machine_downloader

Usage

The simplest way to create a mirror of a Wayback Machine archive is to browse the Wayback Machine web interface and copy the URL of a particular archive snapshot, which is usually available in the Memento format. For example, the first snapshot for the Museum of the History of Science (as it was then known) can be viewed at: https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/.

Then run the MakeStaticSite setup script with this URL as parameter.

$ ./setup.sh -u https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/

The -u flag specifies ‘run unattended’, i.e. non-interactively. Various assumptions will be made as it creates and then uses a configuration file to build a mirror, which in this case comprises precisely one file — the index page. It’s not necessary to supply a complete timestamp with all 14 numerals. A stem suffices, e.g. 1996.

At the other end of the scale, an attempt to download all files without date restrictions can result in a network socket error or else yield tens of thousands of files. This can be checked by using the Wayback Machine Downloader directly:

$ wayback_machine_downloader -l -s http://www.mhs.ox.ac.uk > wayback_cdx_mhs.txt

(The -l flag specifies listing only; -s flag requests all all snapshots/timestamps.)

Date Ranges

A balance may be struck in how much is downloaded by specifying a date range. This can be specified in constants.sh by setting the values for wayback_date_from and wayback_date_to respectively.

For convenience, a date range may be specified directly for processing by setup.sh and makestaticsite.sh. This uses a custom URL that extends the Wayback URLs by including two timestamps separated by a hyphen: https://web.archive.org/web/timestamp_from-timestamp_to/http://www.example.com/.

For example, to confine results for the Museum of the History of Science between 2009 and July 2012, use:

$ ./setup.sh -u https://web.archive.org/web/2009-201207/http://www.mhs.ox.ac.uk/

If in doubt, cast the net wider.

Findings

For sites with more than a few pages, browsing the output will typically be less satisfactory compared with the original. As remarked above, the Wayback Machine’s playback is designed to piece together snapshots gathered at different dates, seeking to maintaining an online user experience as close to the original as it can. The Wayback Machine Downloader doesn’t offer equivalent functionality and it’s not known if other command-line clients provide this, particularly for offline working.

In handing over the site capture to a third party, makestaticsite.sh is ignorant of that process. In terms of phases, it effectively means bypassing phases 2 and 3, so that only in phase 4 are the rains handed back to the MakeStaticSite script. At this stage, MakeStaticSite carries out a few post-crawl refinements, such as the conversion of some internal links to maintain navigation and refinement of pages with HTML Tidy. However, it does not retrieve any further content, so it can’t fill in the gaps.

On the other hand, if a solution is based on Wget, then much of the mirroring functionality that is core to Wget, such as the handling of non-HTML file extensions with corresponding link conversion and downloading page requisites, could fill in gaps. This motivates further work on MakeStaticSite to incorporate these features natively.