Digital Preservation – MakeStaticSite

MakeStaticSite is primarily about supporting the ongoing development of live sites in a secure form. Nevertheless, it can also be used to create archives and thus act as a tool for digital preservation. Here we focus mainly on site layout and web page content in the typical MakeStaticSite context of static offline mirrors, but there is also fledgling support for the main standard, WARC (Web ARChive), which is described here.

Furthermore, MakeStaticSite offers some support for generating offline versions of websites captured by the Internet Archive, which we discuss in a separate page on the Wayback Machine.

Layout and Content

Creating a static snapshot of what is (usually) a dynamic source needs to be faithful more in terms of the information and intended user experience, but the code underlying the generated web page and its components do not need to be identical to the source. In fact, there are many forms such a snapshot can take. MakeStaticSite’s output is dependent on the underlying mirroring tool, Wget, but by varying on the configuration options, there can be significant differences, both in the directory layout and in the content.

The snapshot is shaped by the way it orchestrates the use of various tools and by the additional functionality provided through custom shell scripting. Most of the modifications of web pages are generated by HTML Tidy, but additional changes help inter alia to enable support for newsfeeds and to ensure canonical URLs correspond to the specification in the sitemap file.

Custom cut directories

We illustrate what this means for digital preservation by the use case of archiving just part of a website, reflected in capturing a URL with a path. Here we choose Example 2 from the Getting Started section.

The URL is: https://www.w3.org/WAI/roles/, which we capture with the default Wget options. We ensure we set mss_cut_dirs=yes, for convenience, to avoid drilling down many directories to the top-level index page.

It then generates the following output:

$ tree -L 3
.
└── www.w3.org
    ├── assets20230615_104224
    │   ├── analytics
    │   ├── assets20230615_104224
    │   └── WAI
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

14 directories, 12 files

All the web pages are saved with the file name index.html. The layout of the section as an index page with various subsections devoted to particular topics is evident in the folder hierarchy. Supplementary files have been moved from their original location and are stored under the assets20230615_104224/ folder (the timestamp added as an existing assets/ folder has been detected). The resulting layout is clean and easy to interpret.

The original folder path is preserved in sitemap.xml and it’s currently retained to provide that provenance, even though it would be invalid if these folders were to be deployed to the root of a web site.

The full tree allows one to view all the components that make up the page:

$ tree
.
└── www.w3.org
    ├── assets20230615_104224
    │   ├── analytics
    │   │   └── piwik
    │   │       └── piwik.php?idsite=328&rec=1
    │   ├── assets20230615_104224
    │   └── WAI
    │       ├── assets
    │       │   ├── ableplayer
    │       │   │   ├── button-icons
    │       │   │   │   └── fonts
    │       │   │   │       ├── able.eot?dqripi
    │       │   │   │       ├── able.svg?dqripi
    │       │   │   │       ├── able.ttf?dqripi
    │       │   │   │       └── able.woff?dqripi
    │       │   │   └── images
    │       │   │       └── wingrip.png
    │       │   ├── css
    │       │   │   └── style.css?1686182872756421274.css
    │       │   ├── fonts
    │       │   │   ├── notonaskh
    │       │   │   │   ├── bold-minimal.woff
    │       │   │   │   ├── bold-minimal.woff2
    │       │   │   │   ├── bold.woff
    │       │   │   │   ├── bold.woff2
    │       │   │   │   ├── regular-minimal.woff
    │       │   │   │   ├── regular-minimal.woff2
    │       │   │   │   ├── regular.woff
    │       │   │   │   └── regular.woff2
    │       │   │   ├── notosans
    │       │   │   │   ├── notosans-bolditalic.woff
    │       │   │   │   ├── notosans-bolditalic.woff2
    │       │   │   │   ├── notosans-bold.woff
    │       │   │   │   ├── notosans-bold.woff2
    │       │   │   │   ├── notosans-italic.woff
    │       │   │   │   ├── notosans-italic.woff2
    │       │   │   │   ├── notosans-regular.woff
    │       │   │   │   └── notosans-regular.woff2
    │       │   │   └── notosansmono
    │       │   │       ├── notosansmono-bold.woff
    │       │   │       ├── notosansmono-bold.woff2
    │       │   │       ├── notosansmono-regular.woff
    │       │   │       └── notosansmono-regular.woff2
    │       │   ├── images
    │       │   │   ├── checkbox.svg
    │       │   │   └── social-sharing-default.jpg
    │       │   └── scripts
    │       │       ├── details4everybody.js?1686182872756421274
    │       │       └── svg4everybody.js?1686182872756421274
    │       └── roles
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

28 directories, 43 files

The directory structures of the assets is retained and the original layout can be deduced. For example, supporting files that were originally located under https://www.w3.org/WAI/ are stored in assets20230615_104224/WAI and the folder structure underneath remains intact. Thus, the integrity of the original information architecture is retained; there is no loss of information.

Wget –cut-dirs

Alternatively, Wget comes with an option, --cut-dirs, which can be used in MakeStaticSite (provided mss_cut_dirs is set to no or off). A caveat is that it doesn’t sit well with some of MakeStaticSite’s custom functionality, so the latter gets deactivated. Furthermore, we argue that it should be dropped.

By way of comparison, we have created a snapshot of the same WAI URL with just two changes: in constants.sh, we set mss_cut_dirs=off and in wget_extra_options, we inserted --cut-dirs=2. In terms of user experience, the output is almost identical: visually, it’s the same as above.

However, running the tree command, as above, shows noticeable differences in output:

$ tree -L 3
.
└── www.w3.org
    ├── ableplayer
    │   ├── button-icons
    │   └── images
    ├── css
    │   └── style.css?1686182872756421274.css
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── fonts
    │   ├── notonaskh
    │   ├── notosans
    │   └── notosansmono
    ├── images
    │   ├── checkbox.svg
    │   └── social-sharing-default.jpg
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── piwik.php?idsite=328&rec=1
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── scripts
    │   ├── details4everybody.js?1686182872756421274
    │   └── svg4everybody.js?1686182872756421274
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

20 directories, 18 files

Compared with MakeStaticSite’s custom ‘cut directories’ code, there are many more folders, including five new folders at the root (ableplayer, css, fonts, images, and scripts) plus one file (piwik.php?idsite=328&rec=1 — what’s that doing there?!) It is a generally more complex layout where scanning the web page hierarchy is more difficult. The confusion is largely due to the way --cut-dirs truncates folders, so we have lost the parentage of those five folders. For example, with reference to the full tree above, we see that ableplayer/ is a supporting component that originally came from WAI/assets/ableplayer/.

From a digital preservation perspective this loss of information is significant. Folder names for web pages are typically entered manually to reflect particular context, i.e. they have semantic value. A simple process to glean metadata about a website is to crawl a site and extract folder names from paths. Whilst It might not be needed now, it may be valuable for archival in the long-term.

WARC

The WARC standard is used by digital preservation specialists for preserving Website content. It is technical nature with a terse Wikipedia page, though there are a few helpful introductions for non-specialists such as The stack: An introduction to the WARC file (2021) by Karl-Rainer Blumenthal, Web Archivist for Archive-It. It is recommend background reading, though not necessary when using MakeStaticSite to generate WARC files.

To enable WARC output, you first need to ensure that you have defined warc_output=yes, which can be set in lib/constants.sh or in particular .cfg files (it is set to no by default). MakeStaticSite uses special Wget options to generate WARC files along with the compound index (CDX) files via the wget_extra_options setting. These are set according to a few additional constants:

warc_header_format — the header format for WARC files, which typically denote the software and user undertaking the archival process);
warc_cdx — whether or not to create indices corresponding to individual .warc files;
warc_compress — whether or not to compress WARC files using gzip.
warc_combine_output — whether or not to compress WARC files using gzip.

Once the WARC output has been enabled, MakeStaticSite will proceed to invoke Wget with the requisite options.

Wayback Machine sites are supported.

About the output

As of version 1.25, Wget generates WARC output conforming to version 1.0 of the standard and stores it as .warc files or, when compression is chosen, .warc.gz. The particular format is: warcNN-archive_directory.warc.gz, where NN is a counter from 00 to 99, i.e. 100 individual WARC files are supported.

The content corresponds directly to the responses that Wget receives; there is no further processing by MakeStaticSite. So, this is effectively a parallel archival format. The main additional input is the concatenation of WARC files (assuming warc_combine_output=yes), so that the bundle is self-contained in an analogous way to MakeStaticSite’s usual generation of an offline archive as a zip file bundle.

The archive may be viewed in ReplayWeb.page, which is able to display Wget’s output. The standalone desktop app is especially recommended.