Case study: MakeStaticSite and Pagefind


A perhaps surprising and counter-intuitive application of MakeStaticSite has been to enable site search for a dynamic database-driven site (not static, like this one). by leveraging Pagefind, likewise designed for use in Jamstack scenarios.

A site search facility was needed for Connect2Dialogue, a self-hosted WordPress site using PHP and MySQL to respond to Web requests, all fairly standard. However, it is not a blogging platform, but has been set up as a data-intensive content management system. Furthermore the data model is deeply nested with many tables (resulting especially from numerous custom fields using ACF Pro repeater fields). Hence, the usual SQL-based approach to site search, whether out-of-the-box, or available in plugins, was fiddly to set up and resulted in complex queries, involving a large number of clauses. The net effect was poor performance.

As an alternative, the problem can be approached from the front-end perspective, in terms of the web output rather than the back-end data model. This then shifts the indexing domain from SQL data to text, for which there are many possible solutions, such as Elasticsearch and Apache Solr, which are generally more efficient. Furthermore, the evolution of these and others, towards ‘serverless’ solutions, popularised by the likes of Algolia, has removed the burden of server management. However, these solutions usually create dependencies as most are commonly leveraging third party services, with implications for data protection. And there is still a technical gap to be bridged: the data has to be mapped to a format that allows processing in such a way that generates appropriate results, which requires transformation of data guided by semantic knowledge.

Given this site uses Lunr.js via a fork of the WP Static Search plugin, what about that? Unfortunately, the plugin only indexes standard pages and posts, which for Connect2Dialogue amount to around 50 altogether, when there are actually around 1400 distinct web pages for individuals and organisations. Also, the plugin is only really suitable for small sites.

Enter MakeStaticSite and Pagefind

Whether or not a site requires server-side scripting, the general practice for URL paths (the web page hierarchy and naming scheme) has increasingly become technology-agnostic to support sustainability, tending towards ‘directory’-based URLs, with file extensions removed. This is true of WordPress; permalinks seldom have a .php extension and are generally not plain (using a query string with post ID), but comprise solely nested directory names that are meaningful. Such naming conventions are amenable to the construction of a search index for a ‘serverless’ setup, provided the web server is configured to support directory indexes.

This opens up the use of tools such as Wget (and hence MakeStaticSite) to crawl the sites and generate mirrors that preserve the directory structure. The output can then be used by search tools to generate an index that directly feeds, without amendment, a JavaScript-powered search interface. And Pagefind fits the bill.

Configuring MakeStaticSite

There’s little additional configuration required to MakeStaticSite work with Pagefind. For this example, the built-in support was not used as there’s more processing required beforehand. The main consideration is efficiency: the mainstay of content — individual and organization records — need to be indexed just once under their canonical permalink URLs, not as WordPress-specific plain URLs nor as URLs with version numbers appended.

This kind of WordPress preparation has already been discussed, but additional measures are needed because the site makes extensive use of faceted search where the corresponding URLs (involving compound query strings) can readily be crawled. Such ‘navigation’ URLs need to be excluded also and this can be done during the crawl process with appropriate configuration.

.cfg file

A basic configuration file, call it c2d.cfg, is generated in the usual way by running setup.sh on the target site (in this case connect2dialogue.org), but not going on to immediately crawl the site as a few tweaks are needed to define constraints mentioned above.

For limiting the assets that are downloaded by Wget, we can use the option --reject_regex, which supports a series of regular expressions against which URLs are compared before being downloaded — if they are matched, then the resource is not requested (in contrast to the --reject option which matches on file extensions, downloads and then deletes retrospectively).

The expressions themselves directly reflect query strings. Hence, the following constant is defined:

wget_reject_regex=".*\\{.*|.*\\data\..|.pplocation.|.alphabet.|.keyword.|.topic%5B%5D.|.counties%5B%5D.|.religions%5B%5D.|.religion%5B%5D.|.dvcountry%5B%5D.|.dvtags%5B%5D.|.dvlibrary%5B%5D.|.iclanguage%5B%5D.|.iclocation%5B%5D.|./feed/.|.\.tmp\.html|.index\.html\?."

In this case only one pass of Wget mirroring is needed. Hence the setting:

wget_extra_urls_depth=0

And that’s basically all that’s specifically needed for the crawl stage for the purposes of generating output for Pagefind to index. We don’t need the presentation to be perfect; no need to ensure that external assets are included.

Configuring Pagefind

The other main task is to configure the search tool in a way that leads to relevant search results.

Index Preparation

Preparing the index involves indicating to Pagefind what it can and should ignore in MakeStaticSite’s output (by default it indexes everything within the <body> tag) and what to prioritise for inclusion. A trade-off with not using SQL-based search where specific fields are added to the index, is that effort needs to be put into inserting tags and attributes to guide Pagefind what should and shouldn’t be indexed and, further, add weightings that reflect what is relatively important. Even with these pointers, there’s likely to be a slight loss of pinpoint accuracy, but it may be practically minor.

There are two basic ways this can be achieved. One is by modifying WordPress templates, which is generally simpler and more efficient, but there’s a risk of the modifications being obsolescent as Pagefind (currently released as version 0.4) is under active development and major changes are in the offing. The alternative, described below, is to script recursive search and replace operations, which is not quite as simple, but is non-invasive; we only need to concern ourselves with the scripting environment, in which Pagefind itself is integrated.

Wrapper Script

The wrapper’s code is structured in a similar way to MakeStaticSite itself, with a main() function coordinating calls to various component stages:

main() {
  initialize
  whichos
  read_config "$@"
  if [ "$working_mirror_dir" = "" ]; then
    spider_site
  fi
  build_search
  retention_refresh
  conclude
}

Here, initialize() includes a couple of libraries and defines a number of constants outside of MakeStaticSite’s functioning, i.e. to configure the parameters for Pagefind to help create a more suitable index; spider_site() calls makestaticsite.sh with the option -i c2d.cfg, and then build_search() carries out the processing of output, to tailor it ahead of being processed by PageFind. The script concludes with a clean-up of old directories and finally deployment of the search index, copying the pagefind/ directory in the web document root (after backing up the existing one, if it exists).

Adjustments to MakeStaticSite output

The first set of adjustments is to remove superfluous files; in this case, page navigation was followed to gain access to ‘record’ pages, with numerous listings within a page/ directory. These can be discarded:

find "$working_mirror_dir" -type d -name page -prune -exec rm -rf {} \;

Then there are chunks of HTML from consideration, such as login boxes, as they don’t include useful content:

for pagefind_target in "${pagefind_targets[@]}"; do
  pagefind_attribute_appended="$pagefind_target $pagefind_attribute_ignore"
  for file_ext in "${file_exts[@]}"; do
    sed_subs=('s|\('"$pagefind_target"'\)|'"$pagefind_attribute_appended"'|g')
    find "$working_mirror_dir" -type f -name "index.$file_ext" -print0 | xargs "${xargs_options[@]}" sed "${sed_options[@]}" -e "${sed_subs[@]}"
  done
done

Next, weighting is added to main information or narrative sections:

for discover_dir in "${discover_dir_list[@]}"; do
  for file_ext in "${file_exts[@]}"; do
    sed_subs=('s|'"<div class=\"whp-top\">"'|'"<div class=\"whp-top\" data-pagefind-weight=\"10.0\">"'|g')
    find "$working_mirror_dir/$discover_dir" -type f -name "index.$file_ext" -print0 | xargs "${xargs_options[@]}" sed "${sed_options[@]}" -e "${sed_subs[@]}"
  done
done

Further along, substitutions are made to hide sections and fields before adding markup to indicate thumbnails.

Building and deploying

Having completed the configuration, the search index can be built according to the environment available, e.g. by running a pagefind binary or running within NPM, npx - y pagefind, and then deployed. Along the way, older files can be tidied up (removed or archived).

For automated updates, a cron job has been set up to run the script on a regular basis. In this case, a single run takes around an hour, most of the time taken by the crawl, which is carried out with a fairly modest rate limit of 500KB/s.

Postscript: Embedding in the WordPress Ecosystem

The above combination of MakeStaticSite and Pagefind works from ‘outside’, without any specific knowledge of the WordPress ecosystem. However, this approach can be used internally using plugins, which should benefit from having access to the underlying structure.

A new plugin could be written to perform the role of the wrapper script. It would first call a plugin such as Simply Static to generate the static output. Then it would customise and invoke Pagefind, either incorporated in the same plugin or in its own plugin. After tidying up the output, it would deploy the result, installing the requisite code in the templates for a search box or modal.

Everything would be managed from the WordPress dashboard


This page was published on 13 January 2026 and last updated on 1 February 2026.