How MakeStaticSite can help the beleaguered Internet Archive

Note: This post’s journalistic ‘headline news’ style with its use of dramatic language is to draw attention to the importance of the services provided to the public by the Internet Archive, which (in the author’s view) merit support from all of us.

<><><>

The Internet Archive’s Wayback Machine contains a substantial history of the Web — and hence human knowledge — dating back to 1996. It may be incomplete, but its billions of web pages provide a uniquely detailed and extensive record that is essential to journalists, historians, lawyers, … , anyone who is interested in the world’s history, as recorded on the Web. Unfortunately, it has been under mounting pressure.

As if the ongoing financial burden and legal challenges from publishers were not enough to digest, the service has also been having to deal with distributed denial of service (DDoS) attacks in which malevolent actors attempt to overwhelm the service with huge number of service requests in a short time span. The main effects are simply to disable access, but they can be coordinated with other kinds of attacks that target weaknesses in the underlying systems.

A few days ago, reports emerged, initially on Bleeping Computer and then echoed widely (see, e.g., The Register and Forbes) that the Internet Archive’s databases had been breached (or ‘exfiltrated’) with data stolen, though at least the passwords had been encrypted and salted. Furthermore, the web interface had been defaced. It’s not known if this is a mere coincidence or connected with the DDoS actions.

Brewster Kahle and his team confirmed the breach of security.

The data is safe.

Services are offline as we examine and strengthen them. Sorry, but needed. @internetarchive staff is working hard.

Estimated Timeline: days, not weeks.

Thank you for the offers of pizza (we are set).
— Brewster Kahle (@brewster_kahle) October 11, 2024

The impact of such a security compromise can be far-reaching – when one service succumbs, it affects others either directly or indirectly, often when resources have to be diverted. It certainly affected a plethora of services at the Internet Archive, including Archive-IT, a hosted service for public institutions to archive sites of their choosing:

Following the security breach, Internet Archive hosted this maintenance message with links to accounts on Twitter/X, Bluesky and Mastodon for latest info.

Commendably, the Internet Archive have been pulling out all the stops to restore services, managing to re-enable read access to the Wayback Machine within days (and we hope this holds firm). However, despite reassurances from Kahle that the data is safe, there are surely more question marks over the long-term viability.

It’s not just the Internet Archive, where a Wayback Machine has been knocked out of action. In October 2023, the British Library suffered a major cyber attack, resulting in many services becoming unavailable: such as EThOS: the British Library database of UK theses and its own Wayback Machine. A year on, all that is readily available to the public is metadata; no indication of when theses or website archives will be restored; most of the related discussion appears to be focused on approaches to capturing new sites going forward, for example, capturing social media posts using Browsertrix.

But what about the content that has been archived already? Whilst the archives continue to be stored and augmented with ongoing crawls, during the past year or so, I’ve noticed (as a previous donor) more appeals from the Internet Archive landing in my Inbox, hinting at an increasingly uncertain future. Whilst there may be backups, hopefully distributed in various secure locations, practically speaking that is of little value if there is no convenient public access, which basically means through a web browser.

How about using mirrors? Or is the Internet Archive’s Wayback Machine too closely coupled with other services? Perhaps there are legal constraints? In fact, attempts have been made by a group of volunteers called Archive Team, who carry out a lot of Internet archival, though are unrelated to the Internet Archive,. With the goal of making full backups, they developed a distributed method of cloning and the IA.BAK project was born. However, the initiative was subsequently closed, with the footnote, “The Internet Archive continues to explore methods and code to decentralize the collection, to have a mirror running in various ways … ”

So, the situation appears to be already well understood, and the emails I have been receiving strongly suggest that the bottom line is limited financial resources. Nevertheless, it seems a more robust solution requires both organisational and architectural restructuring. Might the ‘big tech’ companies donate more of their expertise? Perhaps Cloudfare can help and return the support that the Internet Archive has been providing as a fallback for live sites when they become unavailable.

An illustration: H.M. Waterguard

Screenshot of the home page of the HM Waterguard website, featuring its flag in the title and a composite image of various historical photographs. — H.M. Waterguard website home page

As it stands, if web access is the assumed norm, then in many cases the Internet Archive’s Wayback Machine is in effect a single point of failure. To illustrate this, my father was formerly a member of H.M. Waterguard, H.M. Customs & Excise, a service that is now obsolete, with its functions carried out by the UK Border Agency. One of his colleagues, Trevor Tomasin, created a website with photos, documentation, and so on, particularly of interest to former staff and specialist historians. The ‘External Links’ at the bottom includes the URL of his site, http://www.hm-waterguard.org.uk/, but it has been years since it was operative (the domain was sold).

My father knows that Mr Tomasin “donated” his website to the British Library, which presumably included the entire contents. Indeed, on that Wikipedia page there is a link with a specific archive ID to the British Library’s Web Archive service. However, that page has been unavailable ever since the cyber attack about a year ago and currently it redirects to the service’s home page.

As there is no substantial archive at any other Wayback Machine service listed on the Memento time travel service https://timetravel.mementoweb.org/, that leaves the Internet Archive, to my knowledge the only other web-accessible archive, https://web.archive.org/web/20240000000000*/http://www.hm-waterguard.org.uk/. However, whenever it becomes unavailable, Web access to the information is practically non-existent except …

MakeStaticSite and the Wayback Machine

Except that before it went down, I ran MakeStaticSite to capture a sample of pages. It’s not complete, but the basic navigation is in place and it contains enough content to at least be of interest to my father (and note that most of the PDFs were already missing from Internet Archive).

The Internet Archive has generally assumed that access to the Wayback Machine leverages a CDX server that distributes snapshots as bundles. For bots that are only interested in the underlying data that is fine, but it is inconvenient for those who wish to relive the original navigation experience. For that, crawling the Wayback Machine as a browser is much more natural and avoids the complicated work of reconstruction. The main challenge is to handle the hopping between snapshots and – for flexible usage – tweaking the navigation so that it works offline. Whilst it is still work in progress, MakeStaticSite can basically do this already and the outputs, standard Web pages, packaged as a zip file, are ready for sharing and/or hosting securely on a public-facing web server. There’s no need for a heavyweight playback service.

Further details are available of how MakeStaticSite works with the Wayback Machine.