Digital Preservation


MakeStaticSite is primarily about supporting the ongoing development of live sites in a secure form. However, it can also be used to create archives and thus act as a tool for digital preservation, which we reflect on here, focusing on site layout and web page content.

MakeStaticSite also offers some support for generating offline versions of websites captured by the Internet Archive, which we discuss in a separate page on the Wayback Machine.

Layout and Content

Creating a static snapshot of what is (usually) a dynamic source needs to be faithful more in terms of the information and intended user experience, but the code underlying the generated web page and its components do not need to be identical to the source. In fact, there are many forms such a snapshot can take. MakeStaticSite’s output is dependent on the underlying mirroring tool, Wget, but by varying on the configuration options, there can be significant differences, both in the directory layout and in the content.

The snapshot is shaped by the way it orchestrates the use of various tools and by the additional functionality provided through custom shell scripting. Most of the modifications of web pages are generated by HTML Tidy, but additional changes help inter alia to enable support for newsfeeds and to ensure canonical URLs correspond to the specification in the sitemap file.

Custom cut directories

We illustrate what this means for digital preservation by the use case of archiving just part of a website, reflected in capturing a URL with a path. Here we choose Example 2 from the Getting Started section.

The URL is: https://www.w3.org/WAI/roles/, which we capture with the default Wget options. We ensure we set mss_cut_dirs=yes, for convenience, to avoid drilling down many directories to the top-level index page.

It then generates the following output:

$ tree -L 3
.
└── www.w3.org
    ├── assets20230615_104224
    │   ├── analytics
    │   ├── assets20230615_104224
    │   └── WAI
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

14 directories, 12 files

All the web pages are saved with the file name index.html. The layout of the section as an index page with various subsections devoted to particular topics is evident in the folder hierarchy. Supplementary files have been moved from their original location and are stored under the assets20230615_104224/ folder (the timestamp added as an existing assets/ folder has been detected). The resulting layout is clean and easy to interpret.

The original folder path is preserved in sitemap.xml and it’s currently retained to provide that provenance, even though it would be invalid if these folders were to be deployed to the root of a web site.

The full tree allows one to view all the components that make up the page:

$ tree
.
└── www.w3.org
    ├── assets20230615_104224
    │   ├── analytics
    │   │   └── piwik
    │   │       └── piwik.php?idsite=328&rec=1
    │   ├── assets20230615_104224
    │   └── WAI
    │       ├── assets
    │       │   ├── ableplayer
    │       │   │   ├── button-icons
    │       │   │   │   └── fonts
    │       │   │   │       ├── able.eot?dqripi
    │       │   │   │       ├── able.svg?dqripi
    │       │   │   │       ├── able.ttf?dqripi
    │       │   │   │       └── able.woff?dqripi
    │       │   │   └── images
    │       │   │       └── wingrip.png
    │       │   ├── css
    │       │   │   └── style.css?1686182872756421274.css
    │       │   ├── fonts
    │       │   │   ├── notonaskh
    │       │   │   │   ├── bold-minimal.woff
    │       │   │   │   ├── bold-minimal.woff2
    │       │   │   │   ├── bold.woff
    │       │   │   │   ├── bold.woff2
    │       │   │   │   ├── regular-minimal.woff
    │       │   │   │   ├── regular-minimal.woff2
    │       │   │   │   ├── regular.woff
    │       │   │   │   └── regular.woff2
    │       │   │   ├── notosans
    │       │   │   │   ├── notosans-bolditalic.woff
    │       │   │   │   ├── notosans-bolditalic.woff2
    │       │   │   │   ├── notosans-bold.woff
    │       │   │   │   ├── notosans-bold.woff2
    │       │   │   │   ├── notosans-italic.woff
    │       │   │   │   ├── notosans-italic.woff2
    │       │   │   │   ├── notosans-regular.woff
    │       │   │   │   └── notosans-regular.woff2
    │       │   │   └── notosansmono
    │       │   │       ├── notosansmono-bold.woff
    │       │   │       ├── notosansmono-bold.woff2
    │       │   │       ├── notosansmono-regular.woff
    │       │   │       └── notosansmono-regular.woff2
    │       │   ├── images
    │       │   │   ├── checkbox.svg
    │       │   │   └── social-sharing-default.jpg
    │       │   └── scripts
    │       │       ├── details4everybody.js?1686182872756421274
    │       │       └── svg4everybody.js?1686182872756421274
    │       └── roles
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

28 directories, 43 files

The directory structures of the assets is retained and the original layout can be deduced. For example, supporting files that were originally located under https://www.w3.org/WAI/ are stored in assets20230615_104224/WAI and the folder structure underneath remains intact. Thus, the integrity of the original information architecture is retained; there is no loss of information.

Wget –cut-dirs

Alternatively, Wget comes with an option, --cut-dirs, which can be used in MakeStaticSite (provided mss_cut_dirs is set to no or off). A caveat is that it doesn’t sit well with some of MakeStaticSite’s custom functionality, so the latter gets deactivated. Furthermore, we argue that it should be dropped.

By way of comparison, we have created a snapshot of the same WAI URL with just two changes: in constants.sh, we set mss_cut_dirs=off and in wget_extra_options, we inserted --cut-dirs=2. In terms of user experience, the output is almost identical: visually, it’s the same as above.

However, running the tree command, as above, shows noticeable differences in output:

$ tree -L 3
.
└── www.w3.org
    ├── ableplayer
    │   ├── button-icons
    │   └── images
    ├── css
    │   └── style.css?1686182872756421274.css
    ├── designers
    │   └── index.html
    ├── developers
    │   └── index.html
    ├── fonts
    │   ├── notonaskh
    │   ├── notosans
    │   └── notosansmono
    ├── images
    │   ├── checkbox.svg
    │   └── social-sharing-default.jpg
    ├── index.html
    ├── managers
    │   └── index.html
    ├── new
    │   └── index.html
    ├── piwik.php?idsite=328&rec=1
    ├── policy-makers
    │   └── index.html
    ├── robots.txt
    ├── scripts
    │   ├── details4everybody.js?1686182872756421274
    │   └── svg4everybody.js?1686182872756421274
    ├── sitemap.xml
    ├── testers
    │   └── index.html
    ├── trainers
    │   └── index.html
    ├── users
    │   └── index.html
    └── writers
        └── index.html

20 directories, 18 files

Compared with MakeStaticSite’s custom ‘cut directories’ code, there are many more folders, including five new folders at the root (ableplayer, css, fonts, images, and scripts) plus one file (piwik.php?idsite=328&rec=1 — what’s that doing there?!) It is a generally more complex layout where scanning the web page hierarchy is more difficult. The confusion is largely due to the way --cut-dirs truncates folders, so we have lost the parentage of those five folders. For example, with reference to the full tree above, we see that ableplayer/ is a supporting component that originally came from WAI/assets/ableplayer/.

From a digital preservation perspective this loss of information is significant. Folder names for web pages are typically entered manually to reflect particular context, i.e. they have semantic value. A simple process to glean metadata about a website is to crawl a site and extract folder names from paths. Whilst It might not be needed now, it may be valuable for archival in the long-term.

This page was published on 16 June 2023 and last updated on 20 November 2024.