What is Minification?

In web development it's often adventageous to minify embeded files. Minification refers to the process of removing all unnecessary characters while leaving the core functionality in tact. The end result is a new file which is smaller in filesize to the original, yet identical from a machine perspective. The core benefit to smaller files is that they require less bandwidth and are faster for the client to download. Although not intended, the minification profess can make code more difficult for humans to read, which is why minification can also been seen as light weight obfuscation.

Here at data·yze we use PHP to push files from our development enviornment to production. We opt to automate the minification process during this push step. This gives us all of the benefits of minimization without the drawbacks. Our users get the smaller, faster to download version of the code. Our developers working in the testing enviornment work with the larger, more human readable version of the code.

At data·yze, minimizing JavaScript reduces the size of our files by 24% on average. Here is the code we use:

EasyMinifying JavaScript

Javascript can be a little nerve wracking to minify. Statements can be terminated by a semi-colon or a new line. Statements can also span multiple lines, and semi-colons can exist within statements. If you're not careful you can break your code. Fortunatly you can get large gains with simple tricks. The following function, easyMinify achieves an 11% reduction by only condensing multiple white spaces and still keeps the code pretty readable.

function easyMinify($javascript){ return preg_replace(array("/\s+\n/", "/\n\s+/", "/ +/"), array("\n","\n "," "), $javascript); }

Being data people, we're never satisified with a partial result.

More Complete Minification Javascript

We opted to preform a shallow parse of our javascript in order to minify it. A shallow parse is one that has only superficial understanding of structure. Structure is deduced by identifying key elements. In contrast, a deep parse is one that attempts to understand the structure of the script in it's entirety. A deep parse would require something akin to the javascript compiler. Fortunatly, the extra understanding that comes with a deep parse is not nessisary for most minifiers. Thus we stuck with the shallow parse, which is simpler and has a smaller code foot print. This makes it easier to test, easier to debug, and easier to read.

Our minification script assumes the javascript it is minifying compiles and is correct! Garbage in, garbage out.

To make the workflow a little easier to follow, I'm going to start with the helper functions and work our way up to the main function call. Let's start with minifyJavascriptCode. The function minifyJavascriptCode minifies just a code snipped, assuming comments are already stripped and that the code snippet contains no string. Note: the code snippet need not be a complete statement!

function minifyJavascriptCode($javascript){ $blocks = array('for', 'while', 'if', 'else'); // changeDate(+$('#weeks').val()*7 + +$('#days').val()); // pregnancy week by week $javascript = preg_replace('/([-\+])\s+\+([^\s;]*)/', '$1 (+$2)', $javascript); // new line between || in if statements $javascript = preg_replace('/\s+\|\|\s+/', ' || ', $javascript); $javascript = preg_replace('/\s+\&\&\s+/', ' && ', $javascript); $javascript = preg_replace('/\s*([=+-\/\*:?])\s*/', '$1 ', $javascript); // handle missing brackets {} foreach ($blocks as $block){ $javascript = preg_replace('/(\s*\b'.$block.'\b[^{\n]*)\n([^{\n]+)\n/i', '$1{$2}', $javascript); } // handle spaces $javascript = preg_replace(array("/\s*\n\s*/", "/\h+/"), array("\n"," "), $javascript); // \h+ horizontal white space $javascript = preg_replace(array('/([^a-z0-9\_])\h+/i','/\h+([^a-z0-9\$\_])/i'),'$1',$javascript); $javascript = preg_replace('/\n?([[;{(\.+-\/\*:?&|])\n?/', '$1', $javascript); $javascript = preg_replace('/\n?([})\]])/', '$1', $javascript); $javascript = str_replace("\nelse","else",$javascript); $javascript = preg_replace("/([^}])\n/","$1;",$javascript); $javascript = preg_replace("/;?\n/",";",$javascript); return $javascript; }

Lines 5-8 handle singleton clauses not surrounded with braces. Line 11 normalizes white space. White spaces blocks with a new line are reduced into a single newline, where white space blocks without newlines are reduced to a single space. We need to preserve the newline because we do not yet know which ones are terminating statements and which ones aren't. Line 12 strips extra horizontal white space between non variables and numbers. Lines 14-15 strips newlines around punctuation. Again, we do not know yet which newlines terminate statements, so we do not remove newlines after any end braketing (line 15). Line 17 rolls up 'else' clauses. At this point the only newlines left must terminate clauses, so we replace them with semicolons. Since newlines and semicolons require the same number of bits, we could remove all semicolons in favor of newlines if we prefered. Either way, the code will have the same footprint.

While preforming our shallow parse we're going to scan the code for the next key marker that indicates the start of a string or function. We preform this task in getNextKeyElement.

function getNextKeyElement($javascript){ $elements = array(); $keyMarkers = array('\'', '"', '//', '/*'); foreach ($keyMarkers as $marker){ $elements[$marker] = strpos($javascript, $marker); } //regex to detect all regex $regex = "/[\k(](\/[\k\S]+\/)/"; preg_match($regex, $javascript, $matches, PREG_OFFSET_CAPTURE, 1); if (count($matches) > 0){ $elements[$matches[1][0]] = $matches[1][1]; } $elements = array_filter($elements, function($k){return $k !== false;}); if (count($elements) == 0){return false;} $min = min($elements); return array($min, array_keys($elements, $min)[0]); }

Lines 3-7 of getNextKeyElement look for the first instance of the start of a comment, either with // or /* notation, or the start of a string, either single or double quoted. Lines 10-14 look for regular expressions, which need to be treated as a string despite not being quoted in javascript! The regex "/[\k(](\/[\k\S]+\/)/" is a bit meta: a regex designed to detect regexes. When we detect a regex we store the whole regex rather than just the key marker. Why we do that will become more apparent below.

The next step is to find the minimum index in our $elements array. Note that strpos can return false if the specified key marker isn't found in the string, and zero is equivalent to false doing loose comparison. Thus line 16 filters out the 'false' indexes from our array. We need a custom filter for array_filter with strict comparison so we don't accidently filter key marker with zero offset.

We use the min function to find the minimum index (line 19), and array_keys (line 20) to find the marker that corresponds to that minimum index. How we process the characters that come after that key marker will depend on which key marker it is, so we return the tupial (offset, key marker).

Now that we have our key marker which indicates how the next section of the javascript should be processed (comment or string), we need to scan for the ending marker that indicates the next bit of string corresponds to code. It's possible for the end marker to be escaped and embedded in the comment/string. Thus the next helper function is getNonEscapedQuoteIndex which simply scans a string for the next unescaped char $char starting at offset $start.

function getNonEscapedQuoteIndex($string, $char, $start = 0){ if (preg_match('/(\\\\*)('.preg_quote($char).')/', $string, $match, PREG_OFFSET_CAPTURE, $start)){ if ($match[2][1] == 54){ } if (!isset($match[1][0]) || strlen($match[1][0]) % 2 == 0){ return $match[2][1]; } else { return getNonEscapedQuoteIndex($string, $char, $match[2][1] + 1); } } return -1; }

Line 3 uses preg_match to find the next instance of $char from the $start offset. The second capture group captures the character we're searching for. The first, (\\\\*) is capturing the number of dashes before it. If there are no dashes, or the number of dashes is even, then the character is not escaped and we've found the end deliminator. Otherwise, it is escaped and just an instance of the character emedded in the string/comment.

Phew. Almost done. All that's left is the function to run over the entire script, minifyJavascript

function minifyJavascript($javascript){ $buffer = ''; while (list($idx_start, $keyElement) = getNextKeyElement($javascript)){ switch ($keyElement){ case '//': $idx_start = strpos($javascript, '//'); $idx_end = strpos($javascript, "\n", $idx_start); if ($idx_end !== false){ $javascript = substr($javascript, 0, $idx_start) . substr($javascript,$idx_end); } else { $javascript = substr($javascript, 0, $idx_start); } break; case '/*': $idx_start = strpos($javascript, '/*'); $idx_end = strpos($javascript, '*/', $idx_start)+2; $javascript = substr($javascript, 0, $idx_start) . substr($javascript,$idx_end); break; default: // string case $idx_start = getNonEscapedQuoteIndex($javascript, $keyElement); if (strlen($keyElement) == 1){ if (substr($javascript, $idx_start, 1) == '\''){ $idx_end = getNonEscapedQuoteIndex($javascript, '\'', $idx_start+1)+1; } else { $idx_end = getNonEscapedQuoteIndex($javascript, '"', $idx_start+1) +1; } } else { $idx_end = $idx_start + strlen($keyElement); } $buffer .= minifyJavascriptCode(substr($javascript, 0, $idx_start)); $quote = substr($javascript, $idx_start, ($idx_end-$idx_start)); $quote = str_replace("\\\n",' ',$quote); $buffer .= $quote; $javascript = substr($javascript, $idx_end); } } $buffer .= minifyJavascriptCode($javascript); return $buffer; }

The function minifyJavascript is going to loop over the code using our getNextKeyElement function to find the key elements (line 5).

The first two cases handle the comments. The work flow is near identical. Once you have the start of the comment, scan through to the end of the comment. Note it doesn't matter if the comment has a quotation mark, we're ignoring all characters except the end comment deliminator. Once we have the comment end, we cancatonate the javascript before the comment with the javascript after the comment. This effectively removes the comment.

The finall cases handles the "strings" that must be left in tact: single quoted strings, double quoted strings, and regexes. The case of quotes is easy to detect as the key element is just a single character long. For this situation we scan for the unescaped end quote to terminate the string. The code leading up to the "string" is minified using our minifyJavascriptCode and stored in the buffer. Next we store the unmodified "string" in the buffer. The variable $javascript is than modified to contain only the javascript not yet scanned.

We continue this process until the entirety of the javascript string $javascript is processed.

The Complete Minification Article Series: