Hugh Winkler holding forth on computing and the Web

Monday, March 03, 2008

Rule of Least Power: Bah!

The Rule of Least Power, a W3C TAG finding, posits: "Powerful languages inhibit information reuse." They're observing that it's easy to scrape documents written declaratively using HTML. The problem with using more powerful languages like Javascript, they say, is that "you typically cannot determine what a program in a Turing-complete language will do without actually running it."

So? As long as the output is a DOM, just run the program and inspect the DOM.

You already have to use a good HTML parser, right? Now, just run all the script elements on the page too -- obviously, in a restricted environment.

I'm sure Google and friends must do this. They're not going to leave valuable information on the table.