Static 404 pages are wasted opportunities. You know something about the page the user intended to reach, based on the page he or she actually reached. In all likelihood the requested URL is similar to the intended URL. Maybe the user intended to copy and paste the URL, but only did a partial copy. Maybe he or she typed it in to the address bar and accidentally misspelled one of the words. Regardless, we should be able to help the user find their way seamlessly, without the user ever realizing they hit a 404
Our current 404 errors use edit distance to correct URLs, similar to mod_speling but without the apache module. The basic functionality is really simple. Start with the page URL ($url = $_SERVER['REQUEST_URI']), iterate through a list of possible pages and compare them to the requested URL. If a close match is found, we can be reasonably sure it is the one the user intended, and we can redirect them to it. We hide the 404 from the user.
First we create the helper function getAllPages() which returns a prioritized list of pages we want to compare the requested URL against.
This basic implementation of getAllPages() returns a list of all pages and subpages where each page is given uniform weight. The traversal starts at the current directory, '.', (line 5). You can over ride this default behavior by supplying a path argument when you make the function call. The traversed directory structure is stored in the variable $pages (line 4). If you have directories or pages you wished skip you can add them to the $toskip array variable (line 3).
(At data·yze we use a highly customized getAllPages() function based on site usage patterns, designed to emphasized some content over others.)
This implementation of getAllPages yields the following on our domain:
Next we need to preform the spell corrections with makeSpellCorrection():
The requested URL is split into query and path (line 3, 5 and 16.) This allows us to only preform the spell check against the URL path, and to pass the query to the page we redirect the user to. The best score is the one with the fewest edits. Line 12 finds the best score, and line 13 gets a page that corresponds with the best score. Note: there may be more than one, but there is always at least one provided your getAllPages() found at least one page, which is why we don't bother checking the size of the array in line 13. Line 15 is a sanity check.
Edit distance is computed using the built in levenshtein function (line 9) which computes the minimum number of insertions, replacements and deletions needed to transform one string into another. Levenshtein is an O(n2) operation where 'n' is the length of the string. If you have a large list of URLs to compare against, 'k', the total time will be O(k*n2). In our experience n and k are small enough that the performance hit is minimal, especially considering this code will only run when an error is encountered and isn't the expected case. If you have a large enough n you can always write a custom levenshtein function with early stopping criterion.
⚠ A word of caution: It may be tempting to combine the functions getAllPages and makeSpellCorrection, especially if your getAllPages is similar to the one we proposed. If you do, make sure you are preforming the edit distance function against the path in it's entirety rather than piecewise. Otherwise your edit distances may be disproportionately large.
Consider the following example:
- /howto/add-spell-checking-to-404s.php
- /h/owto/add-spell-checking-to-404s.php
The levenshtein edit distance between the two full paths is 1 (there's a extra slash in the second string that needs to be deleted.) Preforming piecewise edit distance would give you an edit distance of 10: 4 insertions to go from the directory "/h" in the second string to "/howto" and 6 deletions to remove the second directory, "owto/".