Online and offline

Whilst headline bandwidth continues to grow, Internet availability remains an uncertainty, whether due to technical outages, energy shortages or policy decisions. It is against this backdrop that MakeStaticSite has been designed to combine the usefulness of a live site, amenable to search engines, and the resilience of archival, whereby a complete copy is available on your own computer.

Thus, MakeStaticSite as a package is cross-platform, requires few components, and is intended primarily to be run on a local machine. And when you execute makestaticsite.sh, a standard build comprises two parts: a site to be deployed and navigated online and a copy of the site, distributed as a zip file, to be browsed offline. Furthermore, for WordPress users, the search experience can be replicated.

The workflow comprises various processes that tweak the output to reflect these design goals. We give attention here to facets that are essential for search engine optimisation (SEO).

Robots and sitemaps

Search engines have certain expectations about how websites are mapped, how each page declares itself, as well as the standards-compliance of markup. This has a major impact on how they crawl and index sites, and fixing issues in piecemeal manner can make the indexing a very lengthy process. Whilst Google Search Console (the replacement for Webmaster Tools) can pinpoint issues and provide guidance on remedies, machine assistant for even a few URLs can take days.

There are two key files that search engines expect on the server – if they’re not there, the site might not get indexed.

robots.txt, a file stored in the root, is a (voluntary) standard that informs crawlers (user agents) what it may and many not crawl on the server – specified through the Allow and Disallow directives.
In addition, the robots file usually specifies the URL of a sitemap, which is a XML document.
(The Kinsta blog provides a detailed description, particularly for WordPress users)
The sitemap (file) provides an index to the main URLs on a site. It may itself list the URLs (of document type text/html) or specify a hierarchical list of sub sitemaps, each of which are XML documents, with the leaf files providing the URL lists .

These two kinds of files may be generated dynamically by a Web CMS, with certain assumptions that may be too restrictive or irrelevant. They might not exist as physical files on the server and, furthermore, Wget does not parse XML files.
For these reasons, it’s preferable that MakeStaticSite generates these files afresh based on the static output generated, reconstructing the sitemap and its components. With regards to the URLs that are included in sitemaps, it helps SEO to include only canonical URLs, which we discuss next.

Canonical URLs

A canonical URL is a unique URL out of a set of URLs that display exactly the same content. For example, both https://makestaticsite.sh/ and https://makestaticsite.sh/index.html are treated as distinct URLs, but only one, not both, can be a canonical URL.

Canonical URLs are important for SEO, as search engines will favour these over other similar URLs that have the same content, as described by Google. Where there are duplicate pages with distinct URLs, search engines need to determine which to pick. For a given Web document, the standard means to specify a canonical URL is to include a canonical link element. All this is discussed in detail, specifically with respect to WordPress, by Yoast. So we need them for deployment, whilst noting that these aren’t useful for offline browsing, so aren’t really necessary for the zip file distribution.

So, we must first consider what rule to use for the format of the canonical URL. Whilst technically, any valid URL might be a candidate, in practice certain formats are preferred over others. In particular, it is considered good practice to end canonical URLs in a trailing slash, /, as that is regarded as appropriate for cross-platform usage. This may be illustrated by consideration of directory indexes, where a ASP application might have an index file index.asp, whereas a later iteration might be in PHP, hence index.php (or vice versa). To future-proof the application, the canonical URL would be stated as / and the server would return the relevant script.

Whatever the definition of the canonical URL, for practical benefit, it has to be referenced by websites, i.e. used in anchors. Here, it should be noted that Wget outputs are designed primarily for archival and hence Web pages generally get saved with .html extension. However, canonical URLs ending in index.html might be deprecated by search engines.

Accordingly, MakeStaticSite has two settings in constants.sh for the tail of the canonical URL: link_href_tail is the tail of canonical link element, and a_href_tail is the tail for internal links. Running makestaticsite.sh initially generates sites with links to URLs ending in .html, following usage of Wget, but then, immediately before deployment, modifies the tail of URL anchors according to the a_href_tail option. The default value for both these options is empty, which translates to URLs ending in /.

Currently, sitemaps are also generated with URLs ending in /.

Content Syndication
(RSS Newsfeeds)

WordPress and other blog platforms provide, as as matter of course, content syndication in the form of RSS newsfeeds. Whilst their utility offline is reduced, they are retained in MakeStaticSite together with references to them, with some slight modification.

The feeds themselves are stored as XML files. A CMS such as WordPress will name these files index.xml, but links referring to them will end in in /. A couple of issues arise with this on serverless web hosting, the directory indexing might serve up the wrong file, or, even if it is the right one, it may be served as content type text/html, in which case the browser won’t handle it as intended. Then, when browsing offline, such links will lead to a ‘file not found’ error.

The situation is exacerbated by Wget, which typically saves these files with a .html extension and updates anchors accordingly. So, to ensure that feeds continue to work, MakeStaticSite renames the feed files (by default) with XML extensions and then updates the links accordingly. Furthermore, the links are relative, so as to continue working offline. This is not normally a problem online, as the base URL can be inferred from the site’s domain. (See the options for feed_html and feed_xml respectively.)

Site Search

An original impetus for this project was the distribution of a research website, with an offline static search facility. This harks back to the days when websites were produced or adapted for distribution on CD ROM or memory sticks (or even floppy disks). This means no web server, just any web browser that provides reasonable support for W3C Web standards, able to load and display web content, follow internal hyperlinks, etc.

Such sites are very portable and MakeStaticSite generally follows this approach. The technology that was commonly used was Java, but it has fallen out of use in this context, generally replaced (if at all) by JavaScript, which once primitive, is now far more advanced. As Web browsers became more complex and effectively evolved into their own operating systems, with JavaScript a core language that ‘runs’ on this OS, practices have changed and much more happens in the browser than just accessing a page to return text and images.

However, the increased constraints imposed by browser security models have generally driven solutions back to requiring a web server again, where for a trusted network, scripting often requires HTTP for ‘same origin’ communications, particularly to invoke other scripts. Hence, it’s more common to speak of a serverless static search, which still requires a web server, but not any server-side scripting framework, whether that’s CGI, PHP or ASP.NET.

How MakeStaticSite supports these two main approaches to static search is discussed next.

Offline Static Search

MakeStaticSite supports a plugin solution for offline static search in WordPress, an adaptation of an existing plugin, WP Static Search, which is based on Lunr.js. Once installed and configured (basically just add a search page with a shortcode), you can edit the site as normal and then press a button to generate a search index. Then run MakeStaticSite with the requisite options and the site will be built with search included.

For a CMS like WordPress, an integrated solution offers certain conveniences. WP Static Search is designed for the ‘serverless’ scenario, but has been modified to work offline, by writing the updated index to a JavaScript file that incorporates the index rather than writing to a separate index file. Whilst this hack has limitations, particularly on the size of the search index (it’s not suitable for large sites), the plugin is in use on this site and elsewhere: you can download this site, put it on a memory stick, disconnect from the Internet, stop any local web servers, and continue carrying out search queries.

This solution also works when hosted on a web server. Similar to the discussion above on canonical URLs, a distinction is made between online and offline modes; in the former case, search results are returned with links ending in a trailing /, whereas for offline usage, these URLs end in index.html.

With development of the original plugin stalled, a fork has been created on GitHub focusing on offline usage. It is still in development and should be re-established as a new project, but in the meantime it is available from this site as a contribution (zip)

‘Serverless’ Static Search

For sites other than WordPress or when you don’t have any control over the source, MakeStaticSite incorporates a ‘serverless’ solution. An alternative is to make use of a hosted solution such as Algolia.

With the growth of static site generators, there have emerged solutions that generate a search index and user interface from an existing static site. MakeStaticSite supports Pagefind, which aims to perform well on large sites whilst being economical on bandwidth. Its first application in conjunction with MakeStaticSite was to Nomads in Oman.

To make use of Pagefind, first ensure that you have a copy on your machine (various installation options are available, including static binaries). Then enabled it by setting pagefind=y (preferably in the relevant .cfg file). Next, consider in turn what pages are to be indexed and where the search boxes should be located. MakeStaticSite provides config options only for the basic runtime settings:

pagefind_options_glob: This defines the scope of pages to be indexed. The default is all files ending .HTM or .HTML, case insensitive. Fine-tuning is possible, according to wax patterns.
pagefind_home_page: A single page where the search box will go
pagefind_pages: various choices for where the search box will go — just the home page, as defined above, all pages, or a selection (comma-separated list)

The two options underneath determine the placement of a code snippet to load the search box.

For convenience, you can also set webserver_preview=y to launch a local web server and run test searches, the command for this set by the constant, webserver_preview_cmd.