Digital Preservation


MakeStaticSite is primarily about supporting the ongoing development of live sites in a secure form. However, it can also be used to create archives and thus act as a tool for digital preservation, which we reflect on here. The discussion is in two parts: first, a general treatment, with particular attention to site layout and web page content; and then particular consideration of retrieving and working with archives from the Wayback Machine.

Layout and Content

Creating a static snapshot of what is (usually) a dynamic source needs to be faithful more in terms of the information and intended user experience, but the code underlying the generated web page and its components do not need to be identical to the source. In fact, there are many forms such a snapshot can take. MakeStaticSite’s output is dependent on the underlying mirroring tool, Wget, but — depending on the configuration options — there can be significant differences, both in the directory layout and in the content. The snapshot is shaped by the way it orchestrates the use of various tools and by the additional functionality provided through custom shell scripting. Most of the modifications of web pages are generated by HTML Tidy, but additional changes help inter alia to enable support for newsfeeds and to ensure canonical URLs correspond to the specification in the sitemap file.

Custom cut directories

We illustrate what this means for digital preservation by the use case of archiving just part of a website, reflected in capturing a URL with a path. Here we choose Example 2 from the Getting Started section.

The URL is: https://www.w3.org/WAI/roles/, which we capture with the default Wget options. We ensure we set mss_cut_dirs=yes, for convenience, to avoid drilling down many directories to the top-level index page.

It then generates the following output:

$ tree -L 3
.
└── www.w3.org
    ├── assets20230615_104224
    │   ├── analytics
    │   ├── assets20230615_104224
    │   └── WAI
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

14 directories, 12 files

All the web pages are saved with the file name index.html. The layout of the section as an index page with various subsections devoted to particular topics is evident in the folder hierarchy. Supplementary files have been moved from their original location and are stored under the assets20230615_104224/ folder (the timestamp added as an existing assets/ folder has been detected). The resulting layout is clean and easy to interpret.

The original folder path is preserved in sitemap.xml and it’s currently retained to provide that provenance, even though it would be invalid if these folders were to be deployed to the root of a web site.

The full tree allows one to view all the components that make up the page:

$ tree
.
└── www.w3.org
    ├── assets20230615_104224
    │   ├── analytics
    │   │   └── piwik
    │   │       └── piwik.php?idsite=328&rec=1
    │   ├── assets20230615_104224
    │   └── WAI
    │       ├── assets
    │       │   ├── ableplayer
    │       │   │   ├── button-icons
    │       │   │   │   └── fonts
    │       │   │   │       ├── able.eot?dqripi
    │       │   │   │       ├── able.svg?dqripi
    │       │   │   │       ├── able.ttf?dqripi
    │       │   │   │       └── able.woff?dqripi
    │       │   │   └── images
    │       │   │       └── wingrip.png
    │       │   ├── css
    │       │   │   └── style.css?1686182872756421274.css
    │       │   ├── fonts
    │       │   │   ├── notonaskh
    │       │   │   │   ├── bold-minimal.woff
    │       │   │   │   ├── bold-minimal.woff2
    │       │   │   │   ├── bold.woff
    │       │   │   │   ├── bold.woff2
    │       │   │   │   ├── regular-minimal.woff
    │       │   │   │   ├── regular-minimal.woff2
    │       │   │   │   ├── regular.woff
    │       │   │   │   └── regular.woff2
    │       │   │   ├── notosans
    │       │   │   │   ├── notosans-bolditalic.woff
    │       │   │   │   ├── notosans-bolditalic.woff2
    │       │   │   │   ├── notosans-bold.woff
    │       │   │   │   ├── notosans-bold.woff2
    │       │   │   │   ├── notosans-italic.woff
    │       │   │   │   ├── notosans-italic.woff2
    │       │   │   │   ├── notosans-regular.woff
    │       │   │   │   └── notosans-regular.woff2
    │       │   │   └── notosansmono
    │       │   │       ├── notosansmono-bold.woff
    │       │   │       ├── notosansmono-bold.woff2
    │       │   │       ├── notosansmono-regular.woff
    │       │   │       └── notosansmono-regular.woff2
    │       │   ├── images
    │       │   │   ├── checkbox.svg
    │       │   │   └── social-sharing-default.jpg
    │       │   └── scripts
    │       │       ├── details4everybody.js?1686182872756421274
    │       │       └── svg4everybody.js?1686182872756421274
    │       └── roles
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

28 directories, 43 files

The directory structures of the assets is retained and the original layout can be deduced. For example, supporting files that were originally located under https://www.w3.org/WAI/ are stored in assets20230615_104224/WAI and the folder structure underneath remains intact. Thus, the integrity of the original information architecture is retained; there is no loss of information.

Wget –cut-dirs

Alternatively, Wget comes with an option, --cut-dirs, which can be used in MakeStaticSite (provided mss_cut_dirs is set to no or off). A caveat is that it doesn’t sit well with some of MakeStaticSite’s custom functionality, so the latter gets deactivated. Furthermore, we argue that it should be dropped.

By way of comparison, we have created a snapshot of the same WAI URL with just two changes: in constants.sh, we set mss_cut_dirs=off and in wget_extra_options, we inserted --cut-dirs=2. In terms of user experience, the output is almost identical: visually, it’s the same as above.

However, running the tree command, as above, shows noticeable differences in output:

$ tree -L 3
.
└── www.w3.org
    ├── ableplayer
    │   ├── button-icons
    │   └── images
    ├── css
    │   └── style.css?1686182872756421274.css
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── fonts
    │   ├── notonaskh
    │   ├── notosans
    │   └── notosansmono
    ├── images
    │   ├── checkbox.svg
    │   └── social-sharing-default.jpg
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── piwik.php?idsite=328&rec=1
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── scripts
    │   ├── details4everybody.js?1686182872756421274
    │   └── svg4everybody.js?1686182872756421274
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

20 directories, 18 files

Compared with MakeStaticSite’s custom ‘cut directories’ code, there are many more folders, including five new folders at the root (ableplayer, css, fonts, images, and scripts) plus one file (piwik.php?idsite=328&rec=1 — what’s that doing there?!) It is a generally more complex layout where scanning the web page hierarchy is more difficult. The confusion is largely due to the way --cut-dirs truncates folders, so we have lost the parentage of those five folders. For example, with reference to the full tree above, we see that ableplayer/ is a supporting component that originally came from WAI/assets/ableplayer/.

From a digital preservation perspective this loss of information is significant. Folder names for web pages are typically entered manually to reflect particular context, i.e. they have semantic value. A simple process to glean metadata about a website is to crawl a site and extract folder names from paths. Whilst It might not be needed now, it may be valuable for archival in the long-term.

The Wayback Machine

MakeStaticSite has some some support (i.e., should be considered in its early stage) for retrieving and restoring content from the Wayback Machine.

The Wayback Machine, with which most users are familiar through its Web interface at web.archive.org, is an exceedingly popular service for accessing web specific content, such as an article, that is no longer available from its original website. It also supports browsing sites as they were, a source of considerable fascination.

However, there may also be a broader need to recover larger amounts of information, even entire sites. In such cases, it’s necessary to be able to download a copy of the site that can be browsed offline. Whereas the Wayback Machine’s web interface provides a straightforward means (via the browser) to download a single page or file (when using a specially crafted URL), no such option is available for a site as a whole. Downloading an entire site requires additional effort, through custom software development and/or the use of a third-party tool or service. By design, the Web pages on web.archive.org are based on snapshots, gathered at various times, delivered within a calendar framework, configured to maintain as much online navigation as possible.

Mirroring such content, at least in our context, should be able to retrieve the original files behind such and process them so that they can work offline as well as online. Using a general crawler such as Wget will download (or scrape) elements of that framework, some snapshots and not others. Without further processing, it will generally not be navigable offline and won’t be fit for purpose. However, it is an option to be retained pending further investigation of available solutions. This list, compiled by the ArchiveTeam, has a particularly interesting page title, restoration, implying a motivation, such as recovery from accidental data loss or the desire to revisit or renew a project. As this is entering the realms of website (re-)creation, particularly from a static source, this naturally falls within the remit of MakeStaticSite.

According to Wayback’s documentation, proper retrieval requires using Wayback APIs in conjunction with the Content Index (CDX) server, whose usage is helpfully summarised on Wikipedia. Accordingly, support was introduced in version 0.27, initially leveraging Wayback Machine Downloader, a tool written in Ruby. We should then determine to what extent does querying the CDX server provide a solution? In particular, how complete is its coverage?

Usage

The simplest way to create a mirror of a Wayback Machine archive is to browse the Wayback Machine web interface and copy the URL of a particular archive snapshot. For example, the first snapshot for the Museum of the History of Science (as it was then known) can be viewed at: https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/.

Then run the MakeStaticSite setup script with this URL as parameter.

$ ./setup.sh -u https://web.archive.org/web/19961219005900/http://www.mhs.ox.ac.uk/

The -u flag specifies ‘run unattended’, i.e. non-interactively. Various assumptions will be made as it creates and then uses a configuration file to build a mirror, which in this case comprises precisely one file — the index page. It’s not necessary to supply a complete timestamp with all 14 numerals. A stem suffices, e.g. 1996. In this case, it will look for

At the other end of the scale, an attempt to download all files without date restrictions can result in a network socket error or else yield tens of thousands of files. This can be checked by using the Wayback Machine Downloader directly:

$ wayback_machine_downloader -l -s http://www.mhs.ox.ac.uk > wayback_cdx_mhs.txt

(The -l flag specifies listing only; -s flag requests all all snapshots/timestamps.)

Date Ranges

A balance may be struck in how much is downloaded by specifying a date range. This can be specified in constants.sh by setting the values for wayback_date_from and wayback_date_to respectively.

For ease of use, a date range may be specified directly for use in MakeStaticSite, respected by setup.sh and makestaticsite.sh). This use a custom URL that extends the Wayback URLs by including two timestamps separated by a hyphen: https://web.archive.org/web/timestamp_from-timestamp_to/http://www.example.com/.

For example, to confine results for the Museum of the History of Science between 2009 and July 2012, use:

$ ./setup.sh -u https://web.archive.org/web/2009-201207/http://www.mhs.ox.ac.uk/

If in doubt, cast the net wider.

Issues

For sites that are more than a few pages, browsing the output will typically be less satisfactory compared with the original. This may be especially noticeable in navigation due to the design of the Wayback Machine: its crawler usually captures only particular snapshots at a time, which its server subsequently pieces together, maintaining online navigation. At present, it’s not known if the custom download tools offer functionality analogous to Wget to fetch page requisites, adjust links, etc.

Whilst MakeStaticSite offers a few post-crawl refinements, such as the conversion of some internal links to maintain navigation and refinement of pages with HTML Tidy, it lacks much of the mirroring functionality that is core to Wget, such as the handling of non-HTML file extensions with corresponding link conversion and downloading page requisites. This motivates an extension to MakeStaticSite to incorporate these features.

This page was published on 16 June 2023 and last updated on 18 July 2023.