What is Minification?

Minification refers to the process of removing all unnecessary characters while leaving the core functionality of the code in tact. The end result is a new file which is smaller in size to the original, yet identical from a machine perspective. The core benefit to smaller files is that they require less bandwidth and are faster for the client to download. Although not intended, the minification profess can make code more difficult for humans to read, which is why minification can also been seen as light weight obfuscation.

Here at data·yze we use PHP to push files from our development environment to production. We opt to automate the minification process during this push step. This approach gives us all of the benefits of minimization (smaller files which require less bandwidth for the users) without the drawbacks (forcing our developers to work with giant blobs of difficult to read code).

At data·yze, minimizing JavaScript reduces the size of our files by 24% on average. Here is the code we use:

EasyMinifying JavaScript

JavaScript can be a little nerve wracking to minify. Statements can be terminated by a semi-colon or a new line. Yet, statements can also span multiple lines and contain semi-colons. If you're not careful when you minify you can easily break your code. Fortunately you can get large gains with simple tricks. The following function, easyMinify achieves an 11% reduction by only condensing multiple white spaces and still keeps the code pretty readable.

function easyMinify($javascript){ return preg_replace(array("/\s+\n/", "/\n\s+/", "/ +/"), array("\n","\n "," "), $javascript); }

Being data people, we're never satisfied with a partial result.

More Complete Minification JavaScript

Our minification process works by preforming a shallow parse of our JavaScript. A shallow parse is one that has only superficial understanding of structure. Structure is deduced by identifying key elements such as the start of a comment or quoted string. In contrast, a deep parse is one that attempts to understand the structure of the script in it's entirety. A deep parse would require something akin to the JavaScript compiler. Fortunately, the extra understanding that comes with a deep parse is not necessary for most minifiers. Thus we stuck with the shallow parse, which is simpler and has a smaller code foot print. This makes it easier to read, easier to test, and easier to debug.

Our minification script assumes the JavaScript it is minifying compiles and is correct! Garbage in, garbage out.

To make the workflow a little easier to follow, I'm going to start with the helper functions and work our way up to the main function call. Let's start with minifyJavascriptCode().

function minifyJavascriptCode()

The function minifyJavascriptCode minifies just a code snipped, assuming comments are already stripped and that the code snippet contains no string. Note, the code snippet need not be a complete statement!

function minifyJavascriptCode($javascript){ $blocks = array('for', 'while', 'if', 'else'); $javascript = preg_replace('/([-\+])\s+\+([^\s;]*)/', '$1 (+$2)', $javascript); // remove new line in statements $javascript = preg_replace('/\s+\|\|\s+/', ' || ', $javascript); $javascript = preg_replace('/\s+\&\&\s+/', ' && ', $javascript); $javascript = preg_replace('/\s*([=+-\/\*:?])\s*/', '$1 ', $javascript); // handle missing brackets {} foreach ($blocks as $block){ $javascript = preg_replace('/(\s*\b'.$block.'\b[^{\n]*)\n([^{\n]+)\n/i', '$1{$2}', $javascript); } // handle spaces $javascript = preg_replace(array("/\s*\n\s*/", "/\h+/"), array("\n"," "), $javascript); // \h+ horizontal white space $javascript = preg_replace(array('/([^a-z0-9\_])\h+/i','/\h+([^a-z0-9\$\_])/i'),'$1',$javascript); $javascript = preg_replace('/\n?([[;{(\.+-\/\*:?&|])\n?/', '$1', $javascript); $javascript = preg_replace('/\n?([})\]])/', '$1', $javascript); $javascript = str_replace("\nelse","else",$javascript); $javascript = preg_replace("/([^}])\n/","$1;",$javascript); $javascript = preg_replace("/;?\n/",";",$javascript); return $javascript; }

Line 6 prevents us from accidentally condensing two plus sins into an prefix increment, e.g. a + +b, by placing the expression +b in parentheses.

Lines 9-11 roll up most multi-line statements where the newline is after or before an operand into single line statements

Lines 14-16 handle unbracketed statements, e.g. if (foo) bar; becomes if (foo){bar};

Lines 19-28 handle the remaining whitespaces. Line 19 & 20 strips extra horizontal white space between non variables and numbers. Lines 12-23 strips newlines around punctuation. Since we do not know yet which newlines terminate statements, so we do not remove newlines after any end bracketing (line 24). At this point the only newlines left must terminate clauses, so we replace them with semicolons in lines 26 & 27. Since newlines and semicolons require the same number of bits, we could remove all semicolons at the end of a line in favor of newlines if we prefered. Either way, the code will have the same footprint.

function getNextKeyElement()

While preforming our shallow parse we're going to scan the code for the next key marker that indicates the start of a string or function. We preform this task in getNextKeyElement().

function getNextKeyElement($javascript){ $elements = array(); $keyMarkers = array('\'', '"', '//', '/*'); foreach ($keyMarkers as $marker){ $elements[$marker] = strpos($javascript, $marker); } //regex to detect all regex $regex = "/[\k(](\/[\k\S]+\/)/"; preg_match($regex, $javascript, $matches, PREG_OFFSET_CAPTURE, 1); if (count($matches) > 0){ $elements[$matches[1][0]] = $matches[1][1]; } $elements = array_filter($elements, function($k){return $k !== false;}); if (count($elements) == 0){return false;} $min = min($elements); return array($min, array_keys($elements, $min)[0]); }

Lines 3-7 of getNextKeyElement() look for the first instance of the start of a comment, either with // or /* notation, or the start of a string, either single or double quoted. Lines 10-14 look for regular expressions, e.g. /fo+/ which need to be treated as a string despite not being quoted in JavaScript! The regex /[\k(](\/[\k\S]+\/)/ is a bit meta: a regex designed to detect regexes. When we detect a regex we store the whole regex rather than just the key marker.

The next step is to find the minimum index in our $elements array which corresponds to the first occurring key element in the block of JavaScript code. Note that strpos() can return false if the specified key marker isn't found in the string, and zero is equivalent to false doing loose comparison. Thus line 16 filters out the 'false' indexes from our array. We need a custom filter for array_filter with strict comparison so we don't accidentally filter key marker with zero offset.

We use the min function to find the minimum index (line 19), and array_keys (line 20) to find the marker that corresponds to that minimum index. How we process the characters that come after that key marker will depend on which key marker it is, so we return the 2-element array [offset, key marker].

function getNonEscapedQuoteIndex()

Now that we have our key marker which indicates how the next section of the JavaScript should be processed (comment or string), we need to scan for the ending marker that indicates the next bit of string corresponds to code. It's possible for the end marker to be escaped and embedded in the comment/string, e.g. 'this isn\'t a very nice test.'. Thus the next helper function is getNonEscapedQuoteIndex() which simply scans a string for the next unescaped char $char starting at offset $start.

function getNonEscapedQuoteIndex($string, $char, $start = 0){ if (preg_match('/(\\\\*)('.preg_quote($char).')/', $string, $match, PREG_OFFSET_CAPTURE, $start)){ if ($match[2][1] == 54){ } if (!isset($match[1][0]) || strlen($match[1][0]) % 2 == 0){ return $match[2][1]; } else { return getNonEscapedQuoteIndex($string, $char, $match[2][1] + 1); } } return -1; }

Line 3 uses preg_match() to find the next instance of $char from the $start offset. The second capture group captures the character we're searching for. The first, (\\\\*) is capturing the number of dashes before it. If there are no dashes, or the number of dashes is even, then the character is not escaped and we've found the end delimiter. Otherwise, it is escaped and just an instance of the character embedded in the string/comment.

Phew. Almost done. All that's left is the function to run over the entire script, minifyJavascript()

function minifyJavascript()

function minifyJavascript($javascript){ $buffer = ''; while (list($idx_start, $keyElement) = getNextKeyElement($javascript)){ switch ($keyElement){ case '//': $idx_start = strpos($javascript, '//'); $idx_end = strpos($javascript, "\n", $idx_start); if ($idx_end !== false){ $javascript = substr($javascript, 0, $idx_start) . substr($javascript,$idx_end); } else { $javascript = substr($javascript, 0, $idx_start); } break; case '/*': $idx_start = strpos($javascript, '/*'); $idx_end = strpos($javascript, '*/', $idx_start)+2; $javascript = substr($javascript, 0, $idx_start) . substr($javascript,$idx_end); break; default: // must be handle like string case $idx_start = getNonEscapedQuoteIndex($javascript, $keyElement); if (strlen($keyElement) == 1){ // quote! Either ' or " if (substr($javascript, $idx_start, 1) == '\''){ $idx_end = getNonEscapedQuoteIndex($javascript, '\'', $idx_start+1)+1; } else { $idx_end = getNonEscapedQuoteIndex($javascript, '"', $idx_start+1) +1; } } else { // regex! $idx_end = $idx_start + strlen($keyElement); } $buffer .= minifyJavascriptCode(substr($javascript, 0, $idx_start)); $quote = substr($javascript, $idx_start, ($idx_end-$idx_start)); $quote = str_replace("\\\n",' ',$quote); $buffer .= $quote; $javascript = substr($javascript, $idx_end); } } $buffer .= minifyJavascriptCode($javascript); return $buffer; }

The function minifyJavascript() is going to loop over the code using our getNextKeyElement function to find the key elements (line 5).

The first two cases handle the comments. The work flow is near identical. Once you have the start of the comment, scan through to the end of the comment. Note it doesn't matter if the comment has a quotation mark, we're ignoring all characters except the end comment delimiter. Once we have the comment end, we concatenate the JavaScript before the comment with the JavaScript after the comment. This effectively removes the comment.

The final cases handles the "strings" that must be left in tact: single quoted strings, double quoted strings, and regexes. The case of quotes is easy to detect as the key element is just a single character long. For this situation we scan for the unescaped end quote to terminate the string. The code leading up to the "string" is minified using our minifyJavascriptCode() function and stored in the buffer. Next we store the unmodified "string" in the buffer. The variable $javascript is than modified to contain only the JavaScript not yet scanned.

We continue this process until the entirety of the JavaScript string $javascript is processed.

The Complete Minification Article Series: