Just a few tips to begin with. If something doesn’t work out as expected, then quite often a situation can be addressed by tweaking the default settings in lib/constants.sh.
Incomplete mirror
If the output generated by MakeStaticSite is missing pages and/or assets, try the following:
(A) Site checks
- check the URL in config file – if the URL has a path, then parent pages will not be retrieved, nor pages that are children of parents.
- check the MakeStaticSite log, particularly for network connectivity; if you lose Internet access during a run, then the download of files may be skipped.
- check the robots exclusion (robots.txt) file on server where you are retrieving files from — Wget respects this by default and recommends not changing this behaviour, for good reason.
(B) Configuration
- Try increasing the value of wget_extra_urls_depth
- if you are trying to download assets that have an extension that’s not recognised, then add it to the list for the constants asset_extensions and/or asset_extensions_external.
- if you are trying to download assets that don’t have an extension, then set asset_extensions / asset_extensions_external to the empty string.
- If the site contains punctuation characters (apart from ‘-‘, ‘_‘ and ‘.‘) in filenames, then ideally these characters should be removed or replaced and any links updated. If that’s not possible, one or two characters, particularly round brackets ), might be removed from url_grep_search_pattern provided they are not used as URL boundaries in the web pages.
If none of the above tips can resolve the situation then for any page with missing content use the browser’s web console and look for the errors reported on the page. (One way to access these is to right-click on the relevant area of a page and select ‘Inspect’.)
Some omissions may be rooted in errors arising from web security policy as variously determined by the server / browser / markup. There is no general remedy; solutions are devised on a case by case basis, but as case history is gradually built up, MakeStaticSite can encode a successively more thorough solution. We illustrate a particular case below.
Cross-Origin Resource Sharing
A relatively common issue arises when scripts (usually in JavaScript) make requests for assets from another origin (usually a domain). Restrictions to such assets have been gradually introduced and expanded. The Cross-Origin Resource Sharing (CORS) is a W3C standard that defines how to allow some cross-origin requests, relaxing the policy, while rejecting others.
The scope presently includes:
- Invocations of fetch() or XMLHttpRequest.
- Web Fonts (for cross-domain font usage in @font-face within CSS), so that servers can deploy TrueType fonts that can only be loaded cross-origin and used by websites that are permitted to do so.
- WebGL textures.
- Images/video frames drawn to a canvas using drawImage().
- CSS Shapes from images.
(For further details, see Mozilla developer docs).
Particularly with respect to offline use of static sites, the following error might be reported when trying to read JavaScript files that were originally from another origin:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at … (Reason: CORS request not http).
Here, the ‘same origin policy’ means that whereas the use of these files works on the original server, once downloaded and accessed on the file system (i.e., not over the Internet), then such files are considered to come from an opaque origin, i.e. are not to be trusted and hence are blocked.
Example (for illustration):
<script> source URI is not allowed in this document: “file:///home/paul/Downloads/websites/www_example_com20240215_145332/www.example.com/imports/assets.example.com/universal/scripts-compressed/extract-css-runtime-39e87d4f1d6ff921db43-min.en-US.js”. <a href="/home/paul/Downloads/websites/www_example_com20240215_145332/www.example.com/">index.html:36:160</a>
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at file:///home/paul/Downloads/websites/www_example_com20240215_145332/www.example.com/imports/assets.example.com/universal/scripts-compressed/extract-css-moment-js-vendor-675f9459672cf966ca51-min.en-US.js. (Reason: CORS request not http).
This restriction has been in place since a security vulnerability was identified in 2019, reflecting a general trend to increasing strictness.
The suggested remedy is (frustratingly):
Developers who need to perform local testing should now set up a local server. As all files are served from the same scheme and domain (localhost) they all have the same origin, and do not trigger cross-origin errors.
Source: MDN Web Docs, Reason: CORS request not HTTP.
Fortunately, at least in some cases, there is a straightforward solution in two steps suitable for offline usage. The first step is to simply fetch assets directly outside of JavaScript (e.g., using Wget).
For the second step, inspecting the HTML source, if you see lines such as:
<script src=" ... >
Then simply delete occurrences of crossorigin="anonymous". The rationale is that you are mirroring a website you have designed or, at least, trust. If it needs to retrieve and work with assets from particular locations, then that should still apply offline.
Accordingly, MakeStaticSite uses Wget in phase 3 to fetch these additional assets and on setting the constant cors_enable=yes, will remove all occurrences of the crossorigin attribute. There may be other attributes that are not required in the specific offline context.
Otherwise, a possible workaround might be to incorporate the content of such files into an existing file that is accepted as same origin.