Whilst headline bandwidth continues to grow, Internet availability remains an uncertainty, whether due to technical outages, energy shortages or policy decisions. It is against this backdrop that MakeStaticSite has been designed to combine the usefulness of a live site, amenable to search engines, and the resilience of archival, whereby a complete copy is available on your own computer.
Thus, MakeStaticSite as a package is cross-platform, requires few components, and is intended primarily to be run on a local machine. And when you execute makestaticsite.sh, a standard build comprises two parts: a site to be deployed and navigated online and a copy of the site, distributed as a zip file, to be browsed offline. Furthermore, for WordPress users, the search experience can be replicated.
The workflow comprises various processes that tweak the output to reflect these design goals. We give attention here to facets that are essential for search engine optimisation (SEO).
Robots and sitemaps
Search engines have certain expectations about how websites are mapped, how each page declares itself, as well as the standards-compliance of markup. This has a major impact on how they crawl and index sites, and fixing issues in piecemeal manner can make the indexing a very lengthy process. Whilst Google Search Console (the replacement for Webmaster Tools) can pinpoint issues and provide guidance on remedies, machine assistant for even a few URLs can take days.
There are two key files that search engines expect on the server – if they’re not there, the site might not get indexed.
robots.txt, a file stored in the root, is a
(voluntary) standard that informs crawlers (user
agents) what it may and many not crawl on the server –
specified through the Allow and Disallow
In addition, the robots file usually specifies the URL of a sitemap, which is a XML document.
(The Kinsta blog provides a detailed description, particularly for WordPress users)
- The sitemap (file) provides an index to the main URLs on a site. It may itself list the URLs (of document type text/html) or specify a hierarchical list of sub sitemaps, each of which are XML documents, with the leaf files providing the URL lists .
These two kinds of files may be generated dynamically by
a Web CMS, with certain assumptions that may be too
restrictive or irrelevant. They might not exist as physical
files on the server and, furthermore, Wget does not parse
For these reasons, it’s preferable that MakeStaticSite generates these files afresh based on the static output generated, reconstructing the sitemap and its components. With regards to the URLs that are included in sitemaps, it helps SEO to include only canonical URLs, which we discuss next.
A canonical URL is a unique URL out of a set of
URLs that display exactly the same content. For example,
both https://makestaticsite.sh/ and
https://makestaticsite.sh/index.html are treated
as distinct URLs, but only one, not both, can be a
Canonical URLs are important for SEO, as search engines will favour these over other similar URLs that have the same content, as described by Google. Where there are duplicate pages with distinct URLs, search engines need to determine which to pick. For a given Web document, the standard means to specify a canonical URL is to include a canonical link element. All this is discussed in detail, specifically with respect to WordPress, by Yoast. So we need them for deployment, whilst noting that these aren’t useful for offline browsing, so aren’t really necessary for the zip file distribution.
So, we must first consider what rule to use for the format of the canonical URL. Whilst technically, any valid URL might be a candidate, in practice certain formats are preferred over others. In particular, it is considered good practice to end canonical URLs in a trailing slash, /, as that is regarded as appropriate for cross-platform usage. This may be illustrated by consideration of directory indexes, where a ASP application might have an index file index.asp, whereas a later iteration might be in PHP, hence index.php (or vice versa). To future-proof the application, the canonical URL would be stated as / and the server would return the relevant script.
Whatever the definition of the canonical URL, for practical benefit, it has to be referenced by websites, i.e. used in anchors. Here, it should be noted that Wget outputs are designed primarily for archival and hence Web pages generally get saved with .html extension. However, canonical URLs ending in index.html might be deprecated by search engines.
Accordingly, MakeStaticSite has two settings in constants.sh for the tail of the canonical URL: link_href_tail is the tail of canonical link element, and a_href_tail is the tail for internal links. Running makestaticsite.sh initially generates sites with links to URLs ending in .html, following usage of Wget, but then, immediately before deployment, modifies the tail of URL anchors according to the a_href_tail option. The default value for both these options is empty, which translates to URLs ending in /.
Currently, sitemaps are also generated with URLs ending in /.
WordPress and other blog platforms provide, as as matter of course, content syndication in the form of RSS newsfeeds. Whilst their utility offline is reduced, they are retained in MakeStaticSite together with references to them, with some slight modification.
The feeds themselves are stored as XML files. A CMS such as WordPress will name these files index.xml, but links referring to them will end in in /. A couple of issues arise with this on serverless web hosting, the directory indexing might serve up the wrong file, or, even if it is the right one, it may be served as content type text/html, in which case the browser won’t handle it as intended. Then, when browsing offline, such links will lead to a ‘file not found’ error.
The situation is exacerbated by Wget, which typically saves these files with a .html extension and updates anchors accordingly. So, to ensure that feeds continue to work, MakeStaticSite renames the feed files (by default) with XML extensions and then updates the links accordingly. Furthermore, the links are relative, so as to continue working offline. This is not normally a problem online, as the base URL can be inferred from the site’s domain. (See the options for feed_html and feed_xml respectively.)
An original impetus for this project was the distribution of a research website, with an offline search facility. This goes beyond the serverless, for which there already exist solutions such as Algolia. As no solution was found, an existing WordPress plugin, WP Static Search, which is based on Lunr.js, was adapted. Although also designed for the serverless scenario, it has been modified to work offline. Whilst it has limitations, particularly on the size of the search index (it’s not suitable for large sites), the plugin is in use on this site and elsewhere: you can download this site, put it on a memory stick, disconnect from the Internet and continue carrying out search queries.
Similar to the discussion above on canonical URLs, a distinction is made between online and offline modes; in the former case, search results are returned with links ending in a trailing /, whereas for offline usage, these URLs end in index.html.
With development of the original plugin stalled, a fork has been created on GitHub focusing on offline usage. It is still in development and should be re-established as a new project, but in the meantime it is available from this site as a contribution (zip)