Bloated HTML, the best and the worse

Frederic Filloux
Monday Note
Published in
4 min readAug 15, 2016

--

One page of text requires between 6 and 55 pages of HTML. More than ever, code inflation rules the web. But there is hope as innovation comes to the rescue.

Google’s new HTML compression facility

by Frederic Filloux

When reading this 800 words Guardian story — about half of page of text long — your web browser loads the equivalent of 55 pages of HTML code, almost half a million characters. To be precise: an article of 757 words (4667 characters and spaces), requires 485,527 characters of code:

Put another way, “useful” text (the human-readable article) weighs less than one percent (0.96%) of the underlying browser code. The rest consists of links (more than 600) and scripts of all types (120 references), related to trackers, advertising objects, analytics, etc.

As I wrote in this previous Monday Note, the sole Chartbeat analytics trackers requires 29,000 characters of code!

It would be useful to know how the amount of code correlates with the Guardian’s abysmal financial losses. (Sad humor, I’m a big fan of The Guardian.)

The Guardian is a kind of extreme when it comes to bloated HTML. In due fairness, this cataract of code loads very fast on a normal connection. The Guardian technical team was also the first one to devise a solid implementation of Google's new Accelerated Mobile Page (AMP) format. In doing so, it eliminated more than 80% of the original code, making it blazingly fast on a mobile device.

As an admittedly biased reference point, I took one of the first texts, World Wide Web Summary, written in HMTL by its inventor Tim Berners-Lee. Published in 1991, it probably is one of the purest, most barebones forms of hypertext markup language: less that 4200 characters of readable text for less that 4600 characters of code. That’s a 90% usefulness rate as shown in the table below (you can also refer to my original Google Sheet here, to get precise numbers, stories URLs and formulae).

This selection is arbitrary but nonetheless interesting to look at. Aside from the original Berners-Lee text, it includes, on line #2, a Washington Post article coded in the experimental Progressive Web App format (more on this in a moment), a classic HTML Politico story, a piece from the official AMP Blog, hopefully coded in AMP, a past Monday Note column published on Medium (which tweaked the W3 standards), a NYT piece, The Guardian piece coded in AMP, a short piece of the MailOnline embedded in a 40-scrolls web page (but remarkably optimized), a WaPo story and the original Guardian one.

Two observations:

  1. If we cut the extreme cases out, the “pure” Berners-Lee and the heavy Guardian one, we see a ratio ranging from 4.55% to 9% of readable text over underlying HTML. The least optimized is the Washington Post that doesn’t mind a large HTML file for reading on a desktop as it offers alternative formats. The lightest (relatively speaking) being Politico that maintains a simple page structure.
  2. The big surprise (at least for me) comes from the Progressive Web App implemented by the Washington Post. The Plain HTML page offers roughly the same content as the PWA version, but with a huge gain in HTML size.

PWA was created by Google about a year ago. It’s a hybrid sitting between mobile apps and mobile sites. According to its official developers blog, it features the following:

Google is just starting to promote the PWA on a large scale and the tools are already available. While it has been already implemented by the giant Indian retailer Flipkart, the Post is the first news publisher to experiment with it (the pages are still a little buggy and don’t support ads yet). Because it supports Push notifications and other features until now reserved to native apps, PWA has great potential for publishers — as long as it doesn’t not end up in Google’s graveyard of innovations lost in some murky internal quagmire…)

Regardless of promising innovations, the war to reduce HTML’s bloat remains to be won. Many forces — advertising technologies, user profiling, endless analytics, trackers — might conspire to eat all the benefits promised by the proposed new standard.

frederic.filloux@mondaynote.com

--

--