The first incarnation of a self-correcting 404 error page involved using PHP to convert the requested URL into a searchable query. The assumption was the URL requested was likely similar enough to the intended URL that a search engine should be able to find the intended URL. While an improvement over a static error page, this approach still involves a step from the user: scanning through the search results to find the right one. We knew we could do better.
Our current 404 errors use edit distance to correct URLs, similar to mod_speling but without the apache module. The basic functionality is really simple. Start with the page URL ($url = $_SERVER['REQUEST_URI']), iterate through a list of possible pages and compare them to the requested URL. If a close match is found, we can be reasonably sure it is the one the user intended, and we can redirect them to it. The user never sees the 404 error.
First we create the helper function getAllPages() which returns a prioritized list of pages we want to compare the requested URL against.
This basic implementation of getAllPages() returns a list of all pages and subpages where each page is given uniform weight. The traversal starts at the current directory, '.', (line 5). You can over ride this default behavior by supplying a path argument when you make the function call. The traversed directory structure is stored in the variable $pages (line 4). If you have directories or pages you wished skip you can add them to the $toskip array variable (line 3).
(At data·yze we use a highly customized getAllPages() function based on site usage patterns, designed to emphazed some content over others.)
This implementation of getAllPages yields the following on our domain:
Next we need to preform the spell corrections with makeSpellCorrection:
The requested URL is split into query and path (line 3, 5 and 16.) This allows us to only preform the spell check against the URL path, and to pass the query to the page we redirect the user to. Edit distance is computed using the built in levenshtein function (line 9) which computes the minimum number of insertions, replacements and deletions needed to transform one string into another. Levenshtein is an O(n2) operation where 'n' is the length of the string. If you have a large list of URLs to compare against, 'k', the total time will be O(k*n2). In our experience n and k are small enough that the preformance hit is minimal, especially considering this code will only run when an error is encountered and isnt the expected case. If you have a large enough n you can always write a custom levenshtein function with early stopping criterion. The best score is the one with the fewest edits. Line 12 finds the best score, and line 13 gets a page that corresponds with the best score. Note: there may be more than one, but there is always at least one which is why we don't bother checking the size of the array. Line 15 is a sanity check.
⚠ A word of caution: It may be tempting to combine the functions getAllPages and makeSpellCorrection, especially if your getAllPages is similar to the one we proposed. If you do, make sure you are preforming the edit distance function against the path in it's entirety rather than piecewise. Otherwise your edit distances may be disproportionatly large.
Consider the following example:
The levenshtein edit distance between the two full paths is 1 (there's a extra slash in the second string that needs to be deleted.) Preforming piecewise edit distance would give you an edit distance of 10: 4 insertions to go from the directory "/h" in the second string to "/howto" and 6 deletions to remove the second directory.