screenshot

What is Minification?

Minification refers to the process of removing all unnecessary characters while leaving the core functionality of the code in tact. The end result is a new file which is smaller in size to the original, yet identical from a machine perspective. The core benefit to smaller files is that they require less bandwidth and are faster for the client to download. Although not intended, the minification profess can make code more difficult for humans to read, which is why minification can also been seen as light weight obfuscation.

Here at data·yze we use PHP to push files from our development environment to production. We opt to automate the minification process during this push step. This approach gives us all of the benefits of minimization (smaller files which require less bandwidth for the users) without the drawbacks (forcing our developers to work with giant blobs of difficult to read code).

A word of caution: when minimizing, always test the final product still behaves as expected!

Challenges in Minimizing HTML (without Embedded PHP)

When we talk about minimizing HTML we're often referring to minimizing white space. By default, web browsers collapse multiple white spaces into a single space, yet nicely formatted HTML source code often contains long blocks of white space to make the source code more human readable.

For example this:

<html> <head> <title> ... </title> </head> <body> <table> <tr> <td> ... </td> </tr> </table> </body> </html>
Is a much longer, albeit easier to read equivalent of:
<html><head><title> ... </title></head><body><table><tr><td> ... </td></tr></table></body></html>

It is worth noting any white space between the <pre> tag is preserved, as is any in a element with style white-space:pre or white-space:pre-wrap. In order to make our solution the most general, and omit the need for a full CSS interpreter, we're going to omit the cases of elements with white-space CSS style attribute set. We'll show you how you can modify the script to account for specific such elements should you need to. Our solution is not general enough to analyze CSS and determine which elements need to have their white space preserved by class name on the fly.

Challenges in Minimizing HTML (with Embedded PHP)

Different sections of the HTML page need to be handled separately. White space effects the execution of CSS and JavaScript differently than HTML, and minimizing them as though they were HTML could introduce bugs into the code.

The primary challenge with PHP is that, when executed, it will likely insert some information into the HTML document being sent to the user. Usually this is text, but PHP can be used as a control flow, changing which JavaScript, CSS and even HTML elements are outputted in the final HTML document sent to the user.

Consider the following obviously contrived case:

<style> body{ background-color:<?php echo 'white;}</style>'; ?> <body>

In the above example, the end style tag is outputted to the HTML buffer by the executed PHP code. In order to know that line 4 in the above example should be treated as HTML, our minifying script would need to be able to interpret the PHP code which would increase the complexity of our minifying code dramatically.

To make the problem tractable, we make the following assumptions about our PHP code:

We do allow PHP to be used to output JavaScript, HTML, and CSS as long as only one language is written to the output buffer in each block of PHP code. This allows us to pass variables and information easily from PHP to JavaScript. (e.g. var foo = <?php echo $bar; ?>;.

Since PHP is executed server side rather than client side, non minimized PHP code does not effect HTML file size nor bandwidth and does not need to be minimized. Since the amount of code expected to be outputted by the PHP is small, we don't worry about minimizing it.

Minimizing HTML

Our approach to minimizing HTML is going to be similar to our approach to minimizing JavaScript: shallow parse the HTML looking for CSS/JavaScript/PHP. A shallow parse is one that has only superficial understanding of structure. A shallow parse is sufficient for this use case, and has a much smaller footprint than a deep parse.

When we encounter CSS or JavaScript during our shallow parse, we minimize as appropriate. We're going to make one modification to our previous JavaScript minification.

Modified function MinifyJavascript()

function minifyJavascript($javascript, $inQuote = false){ $buffer = ''; if ($inQuote != false){ $idx_end = getNonEscapedQuoteIndex($javascript, $inQuote)+1; if ($idx_end == 0){ return array($javascript, $inQuote); } $quote = substr($javascript, 0, $idx_end); $quote = str_replace("\\\n",' ',$quote); $quote = preg_replace("/\s+/",' ',$quote); $buffer = $quote; $javascript = substr($javascript, $idx_end); $inQuote = false; } while (list($idx_start, $keyElement) = getNextKeyElement($javascript)){ switch ($keyElement){ case '//': $idx_end = strpos($javascript, PHP_EOL, $idx_start); if ($idx_end !== false){ $javascript = substr($javascript, 0, $idx_start) . substr($javascript,$idx_end); } else { $javascript = substr($javascript, 0, $idx_start); } break; case '/*': $idx_end = strpos($javascript, '*/', $idx_start)+2; $javascript = substr($javascript, 0, $idx_start) . substr($javascript,$idx_end); break; default: // string case if ($keyElement == '\'' || $keyElement == '"'){ $idx_end = getNonEscapedQuoteIndex($javascript, $keyElement, $idx_start+1)+1; } else { $idx_end = $idx_start + strlen($keyElement); } // php is embedded in string in javascript if ($idx_end == 0){ $idx_end = strlen($javascript); $inQuote = $keyElement; } $buffer .= minifyJavascriptCode(substr($javascript, 0, $idx_start)); $quote = substr($javascript, $idx_start, ($idx_end-$idx_start)); $quote = str_replace("\\\n",' ',$quote); $quote = preg_replace("/\s+/",' ',$quote); $buffer .= $quote; $javascript = substr($javascript, $idx_end); } } if ($inQuote){ return array($buffer, $inQuote); } $buffer .= minifyJavascriptCode($javascript); return $buffer; }

The primary different between our updated MinifyJavascript() function, and the one written in the Minifying JavaScript article is the additional, $inQuote optional functional parameter. This parameter is used to indicate we're in a quote, like our variable passing example, and should be treated as such.

function getHTMLKeyControlElements()

Next we need a function to guide our shallow parser, and indicate which control element we've encountered next: <style> for CSS, <javascript> or script for PHP code, etc. The return value is a two element array, the first value being the offset to the first occurring control sequence, and the second being a key that indicates which control sequence is encountered.

function getHTMLKeyControlElements($php){ $elements = array(); $elements['<?'] = strpos($php, '<?'); if(preg_match("/<\s*script(?:\s+type=\"text\/javascript\")?\s*>/i", $php, $matches, PREG_OFFSET_CAPTURE)) { if ($matches[0][1] > 0){ $elements['<script>'] = $matches[0][1]; } } if(preg_match("/<\s*style(?:\s+type=\"text\/css\")?\s*>/i", $php, $matches, PREG_OFFSET_CAPTURE)) { if (count($matches) > 0){ $elements['<style>'] = $matches[0][1]; } } if(preg_match("/<\s*div\s+class\s*=\s*\"phpcode\"\s*>/i", $php, $matches, PREG_OFFSET_CAPTURE)) { if (count($matches) > 0){ $elements['<div>'] = $matches[0][1]; } } if(preg_match("/<\s*pre\s*>/i", $php, $matches, PREG_OFFSET_CAPTURE)) { if (count($matches) > 0){ $elements['<pre>'] = $matches[0][1]; } } $elements = array_filter($elements, function($k){return $k !== false;}); if (count($elements) == 0){return false;} $min = min($elements); return array($min, array_keys($elements, $min)[0]); }

The key variable being stored in the elements array is an indicator of control flow, indicating which language we're switching to. Tags may appear slightly differently, such as <script language="text/javascript"> or <SCRIPT>. By storing a canonical version of each tag type we can more easily handle each case.

The value of the elements array corresponds to the offset where the control sequence was first encountered with a simple switch statement.

Note lines 18-22. These lines are searching for a div with class phpcode (and no additional attributes.) This class is a special case at data·yze where white-space has been set to pre-wrap, and we're guaranteed to have no nested div elements. We have chosen to leave it in the code as an example of how you could handle white-space:pre elements, or classes that set white-space to pre and pre-wrap, should you have them on your site.

function minifyHTML()

The next function, handlePHP preforms the specific minification needed for each code block.

function handlePHP($file){ $php = file_get_contents($file); $buffer = ''; while (list($start_idx, $key) = getHTMLKeyControlElements($php)){ switch ($key){ case '<?': $end_idx = strpos($php, '?>', $start_idx+1); $buffer .= minifyHTML(substr($php, 0, $start_idx)) . substr($php,$start_idx, $end_idx+2-$start_idx); $php = substr($php, $end_idx+2); break; case '<style>': $buffer .= minifyHTML(substr($php,0,$start_idx)).'<style type="text/css">'; $php = substr($php, strpos($php,'>',$start_idx+1)+1); $end_idx = strpos($php,'</style>'); while (strpos($php, '<?') < $end_idx){ $tmp_idx = strpos($php, '<?'); $tmp_end_idx = strpos($php, '?>') + 2; $buffer .= minifyCSS(substr($php, 0, $tmp_idx)) . substr($php, $tmp_idx, $tmp_end_idx-$tmp_idx); $php = substr($php, $tmp_end_idx); $end_idx = strpos($php, '</style>'); } $buffer .= minifyCSS(substr($php,0,$end_idx)). '</style>'; $php = substr($php, $end_idx+8); break; case '<script>': $buffer .= minifyHTML(substr($php, 0, $start_idx)).'<script type="text/javascript">'; $php = substr($php, strpos($php,'>',$start_idx+1)+1); $inQuote = false; $end_idx = strpos($php, '</script>'); while (strpos($php, '<?') < $end_idx){ $tmp_idx = strpos($php, '<?'); $tmp_end_idx = strpos($php, '?>') + 2; $result = minifyJavascript(substr($php, 0, $tmp_idx), $inQuote); if (is_array($result)){ $buffer .= $result[0]; $inQuote = $result[1]; } else { $buffer .= $result; $inQuote = false; } $buffer .= substr($php, $tmp_idx, $tmp_end_idx-$tmp_idx); $php = substr($php, $tmp_end_idx); $end_idx = strpos($php, '</script>'); } $result = minifyJavascript(substr($php, 0, $end_idx), $inQuote); $buffer .= $result . '</script>'; $php = substr($php, $end_idx + 9); break; case '<div>': $end_idx = strpos($php, '</div>', $start_idx+1); $buffer .= minifyHTML(substr($php, 0, $start_idx)) . substr($php,$start_idx, $end_idx+6-$start_idx); $php = substr($php, $end_idx+6); break; } } $buffer .= minifyHTML($php); return $buffer; }

This code works as follows. We use a switch statement on the next control element returned by getHTMLKeyControlElements(). In each case we minimize the block of code from the current offset to the offset of the next control element as PHP. We then handle the block of code from the control element to the ending control flow element as appropriate. For Style and Script tags we need to look for embedded PHP.

Line 59 corresponds with the our specific white-space:pre-wrap class case. Note, no minification is preformed within the div, as the white space needs to be preserved.

function minifyHTMLFile()

function minifyHTML($html){ return preg_replace('/\s+/',' ', $html); }

As stated above, the minifying of HTML is just a straight forward collapsing of white space. There's still room for improvement. After all, white space between elements that contain no non-white space characters are also collapsed. For example, " <i> <b> " is functionally equivalent to " <i><b>". Nevertheless this is a good first pass.

The Complete Minification Article Series: