Limitations


Whilst MakeStaticSite is functional, it grew from the particular needs of an individual and has many limitations.

  • This is prototype software, provided as-is and tested on only a few sites, but in the hope that it will prove useful and become community-supported
  • This is a static HTML crawler that retrieves web content without running any JavaScript for client-side rendering, not a dynamic crawler that can process the JavaScript on that page and then render it. Even so, the workflow architecture might support processing of web page outputs in this way.
  • The system is designed for the original GNU Wget, whereas most development effort is now on GNU Wget2.
  • It has been developed to support static snapshots of individual sites. Even though MakeStaticSite can capture assets from multiple domains, it is not a general-purpose web crawler designed to index huge swathes of the Internet.
  • Static site generation is generally not a good fit for collections databases with a large inventory
  • The script can only provide a snapshot of comments, discussions, surveys and so on; the interactivity of such components along with the persistence of user-contributed data is generally lost. (In the long run a project might isolate these in a hybrid setup.)
  • Whilst Wget is a mature product that embodies a deep understanding of Internet protocols and the networking environment, it doesn’t have intimate CMS knowledge and so it might not retrieve everything. This may be the case for orphan pages, which a WordPress plugin, for example, might be able to access. For Wget they need to be added explicitly as extra input.
  • Performance: MakeStaticSite is not compiled code; it requires a command line interpreter and the scripts have not yet been much optimised for speed. It typically takes up to a few minutes to build a small- to medium- sized site, which, depending on usage scenario, may or may not be a significant duration. A substantial part of this is due to the reliance on Wget to re-crawl many pages at each run, but there is further overhead with Wayback Machine sites, for which additional routines are needed to properly limit Wget requests.
    For phase 3 (fetching additional assets), there is the wget_threads option that can reduce the time to download from the Web.
  • Links generated dynamically by JavaScript are not included.
  • For WordPress sites, using WP-CLI remotely over ssh may not be fully supported by hosting providers running jailed shells for shared hosting. In that case, WordPress updates need to be done manually.

To overcome at least some of these limitations, there are alternatives that may be explored.

This page was published on 31 October 2022 and last updated on 12 September 2024.