What is Minification?

Minification refers to the process of removing all unnecessary characters from a file while leaving the core functionality of the code in tact. The end result is a new file which is smaller in size to the original, yet identical from a machine perspective. The core benefit to smaller files is that they require less bandwidth and are faster for the client to download. Although not intended, the minification profess can make code more difficult for humans to read, which is why minification can also been seen as light weight obfuscation.

Want to skip the technical discussion in this article? No problem, you can try the latest version of our Minifier.

How data·yze does Minification

Before we begin the technical discussion we should explain where and how we use Minification. We use PHP to transfer files from our development environment to production. We opt to automate the minification process during this publication step. This approach gives us all of the benefits of minimization (smaller files which require less bandwidth for the users) without the drawbacks (forcing our developers to work with giant blobs of difficult to read code). Since minification is preformed when a page is published, and not each time a page is accessed, we're less concerned with performance. Thus we opt for readability and ease of debugging over preformance.

A word of caution: when minimizing, always test the final product still behaves as expected!

Challenges in Minimizing HTML (without Embedded PHP)

When we talk about minimizing HTML we're often referring to minimizing white space. By default, web browsers collapse multiple white spaces into a single space, yet HTML source code formatted according to style guidlines often contains long blocks of white space to make the source code more human readable.

For example this:

<html> <head> <title> ... </title> </head> <body> <table> <tr> <td> ... </td> </tr> </table> </body> </html>

Is a much longer, albeit easier to read equivalent of:

<html><head><title> ... </title></head><body><table><tr><td> ... </td></tr></table></body></html>

Different sections of the HTML page need to be handled differently. Specifically white space effects the execution of emedded CSS and JavaScript differently than HTML, and minimizing them as though they were HTML could introduce bugs into the code.

Additionally, it is worth noting any white space between the <pre> tag needs to be preserved, as is any in a element with style white-space:pre or white-space:pre-wrap. In order to make our solution the most general, and skip the need for a full CSS interpreter, we're going to omit the cases of elements with white-space CSS style attribute set. We'll show you how you can modify the script to account for specific such elements should you need to. Our solution is not general enough to analyze CSS and determine which elements need to have their white space preserved by class name on the fly.

Challenges in Minimizing HTML (with Embedded PHP)

The primary challenge with PHP is that, when executed, it will likely insert some information into the HTML document being sent to the user. Usually this is text, but PHP can be used as a control flow, changing which JavaScript, CSS and even HTML elements are outputted in the final HTML document sent to the user.

Consider the following obviously contrived case:

<style> body{ background-color:<?php echo 'white;}</style>'; ?> <body>

In the above example, the end style tag is outputted to the HTML buffer by the executed PHP code. In order to know that line 4 in the above example should be treated as HTML, our minifying script would need to be able to interpret the PHP code which would increase the complexity of our minifying code dramatically.

To make the problem tractable, we make the following assumptions about our PHP code:

We do allow PHP to be used to output JavaScript, HTML, and CSS as long as only one language is written to the output buffer in each block of PHP code. This allows us to pass variables and information easily from PHP to JavaScript. (e.g. var foo = <?php echo $bar; ?>;

Since PHP is executed server side rather than client side, non minimized PHP code does not effect HTML file size nor bandwidth and does not need to be minimized.

Minimizing HTML

Our approach to minimizing HTML is going to be similar to our approach to minimizing JavaScript: shallow parse the HTML looking for CSS/JavaScript/PHP. A shallow parse is one that has only superficial understanding of structure. A shallow parse is sufficient for this use case, and has a much smaller footprint than a deep parse. When we encounter CSS or JavaScript during our shallow parse, we minimize as appropriate.

Important Detail: Don't forget to copy the code from parts 1 and 2 (Minifying CSS with PHP and Minifying JavaScript with PHP) of this three part Minification series or your minifyPHP() function will not work!

As before, the first thing we want to do is preserve important strings. This time we want to be sure we preserve all emedded PHP. To do tis we create the function preserveEmeddedPHP()

function preserveEmeddedPHP($string){ global $minificationStore, $singleQuoteSequenceFinder, $doubleQuoteSequenceFinder; $start_idx = strpos($string, '<?'); //matches both <? and <?php if (strlen($string)==0){return $string;} if ($start_idx !== false){ //need to find first end terminator not in quote $php_len = 2; while (true){ // start looking for the PHP terminator from the PHP start $tmp_string = substr($string, $start_idx + $php_len); $end_php = strpos($tmp_string, '?>'); $end_php = ($end_php !== false ? $end_php+2 : strlen($tmp_string)); // find the closest string $quote_start = false; $singleQuoteSequenceFinder->findFirstValue($tmp_string); $doubleQuoteSequenceFinder->findFirstValue($tmp_string); if ($singleQuoteSequenceFinder->isValid() && (!$doubleQuoteSequenceFinder->isValid() || $singleQuoteSequenceFinder->start_idx < $doubleQuoteSequenceFinder->start_idx)){ $quote_start = $singleQuoteSequenceFinder->start_idx; $quote_end = $singleQuoteSequenceFinder->end_idx; } else if ($doubleQuoteSequenceFinder->isValid()){ $quote_start = $doubleQuoteSequenceFinder->start_idx; $quote_end = $doubleQuoteSequenceFinder->end_idx; } // check if end terminator before string declared. If not, start search again after the string declared if ($quote_start === false || $end_php <= $quote_start){ $php_len += $end_php; break; } else { $php_len += $quote_end; } } // store the found PHP $php_substr = substr($string, $start_idx, $php_len); $placeHolder = getNextMinificationPlaceholder(); $newstring = substr($string, 0, $start_idx).$placeHolder.substr($string, $start_idx+$php_len); $minificationStore[$placeHolder] = $php_substr; // search for next emedded PHP to preserve return preserveEmeddedPHP($newstring); } return $string; }

The function preserveEmeddedPHP() seems simple enough. When a <? is detected in a string (line 5), preserveEmeddedPHP() sets $end_php to the first position of '?>' in line 15. We can't stop there. That closing '?>' may have occured in the middle of a string and is not intended to end the PHP code block. Thus lines 18-28 look for the next search for the nearest double and single quoted strings.

Next we're going to extend our MinificationSequenceFinder and create a sequence finder capable of searching for regular expressions called RegexSequenceFinder

class RegexSequenceFinder extends MinificationSequenceFinder { protected $regex; public $sub_match; public $sub_start_idx; public $start_idx = false; public $full_match; public $end_idx; function __construct($type, $regex) { $this->type = $type; $this->regex = $regex; } public function findFirstValue($string){ preg_match($this->regex, $string, $matches, PREG_OFFSET_CAPTURE); if (count($matches) == 0){ $this->start_idx = false; return false; } // full match $this->full_match = $matches[0][0]; $this->start_idx = $matches[0][1]; if (count($matches) > 1){ // substart $this->sub_match = $matches[1][0]; $this->sub_start_idx = $matches[1][1]; } $this->end_idx = $this->start_idx + strlen($this->full_match); } }

RegexSequenceFinder searches for the first occurance of the regex in the sample string. Once it's found, the start index (start_idx) and the end index (end_idx) of the entire regex are stored at lines 20 and 34. If a submatch is also found, it's start index (sub_start_idx) as well as the entire matching string (sub_match) are also stored at lines 30 and 31.

Now we're ready to minify. To do this we create a function minifyPHP. We use the RegexSequenceFinder to create a sequence finder to find JavasScript delcarations (line 6) where the submatch is the JavaScript, a finder to find css (line 7) where the submatch is the CSS, and a finder to find <pre> html blocks (line 8). getNextSpecialSequence on line 14 returns the first occuring sepecial sequence, and the switch case on line 22 handles each special string according to it's type: CSS, JavaScript, or pre (the default block).

function minifyPHP($html){ global $minificationStore; $html_special_chars = array( new RegexSequenceFinder('javascript', "/<\s*script(?:[^>]*)>(.*?)<\s*\/script\s*>/si"), // javascript, can have type attribute new RegexSequenceFinder('css', "/<\s*style(?:[^>]*)>(.*?)<\s*\/style\s*>/si"), // css, can have type/media attribute new RegexSequenceFinder('pre', "/<\s*pre(?:[^>]*)>(.*?)<\s*\/pre\s*>/si") // pre ); $html = preserveEmeddedPHP($html); // pull out everything that needs to be pulled out and saved while ($sequence = getNextSpecialSequence($html, $html_special_chars)){ $placeholder = getNextMinificationPlaceholder(); $quote = substr($html, $sequence->start_idx, $sequence->end_idx - $sequence->start_idx); // subsequence (css/javascript/pre) needs special handeling, tags can still be minimized using minifyPHP $sub_start = $sequence->sub_start_idx- $sequence->start_idx; $sub_end = $sub_start + strlen($sequence->sub_match); switch ($sequence->type) { case 'javascript': $quote = minifyPHP(substr($quote,0,$sub_start)).minifyJavascript($sequence->sub_match).minifyPHP(substr($quote, $sub_end)); break; case 'css': $quote = minifyPHP(substr($quote,0,$sub_start)).minifyCSS($sequence->sub_match).minifyPHP(substr($quote, $sub_end)); break; default: // strings that need to be preservered, e.g. between <pre> tags $quote = minifyPHP(substr($quote,0,$sub_start)).$sequence->sub_match.minifyPHP(substr($quote, $sub_end)); } $minificationStore[$placeholder] = $quote; $html = substr($html, 0, $sequence->start_idx).$placeholder.substr($html, $sequence->end_idx); } // condense white space $html = preg_replace( array('/\s+/','/<\s+/', '/\s+>/'), array(' ', '<', '>'), $html); // remove comments $html = preg_replace('/<!--([^-](?!(->)))*-->/', '', $html); // put back the preserved strings foreach($minificationStore as $placeholder => $original){ $html = str_replace($placeholder, $original, $html); } return trim($html); }

As stated above, the minifying of HTML is just a straight forward collapsing of white space. There's still room for improvement. After all, white space between elements that contain no non-white space characters are also collapsed. For example, " <i> <b> " is functionally equivalent to " <i><b>". Nevertheless this is a good first pass.

Evaluating the Output

To see how minifyPHP does, consider the following input.

<html><style> /* this should be removed */body{background-color:white;}</style>/* this should NOT be removed */ <!-- REMOVE ME! --></html>

The above example contains some CSS that will need to be minified as CSS, not HTML, and an HTML comment. Minified we get

<html><style>body{background-color:white}</style>/* this should NOT be removed */ </html>

The Complete Minification Article Series:

Want to give it a try? Use our Minifier.

Code Liscence

Although code shared on data·yze is source-avaliable, it is still proprietary and data·yze maintains it's intellectual property rights. In particular, data·yze restricts redistribution of the code. Code displayed above may be copied, modified, displayed or adapted for use on other websites (commercial or otherwise) only under certain conditions and may not be repackaged or redistributed. See Terms for details.