NYTimes.com Article: In Searching the Web, Google Finds Riches
jamesr at best.com
Wed Apr 16 17:30:19 PDT 2003
Russell Turpin wrote:
> I'm not sure people realize why Google is so
> neat. Google succeeded because it killed
> search spam. When you search for "high
> dividend stocks," you DON'T get back pages
> on porn, MLMs, and debt consolidation, ranked
> artificially high by a variety of tricks that
> search spammers once used.
Ahhh, the good old days.
Search engine spamming "back in the day" was an arms race that I got
involved in largely as a theoretical challenge. Before Google and similar,
all search engines had a weakness that made them vulnerable to a general
mathematical exploit. Early on, search engine spamming was a hit-or-miss
proposition with shaky heuristics that changed regularly. As a project in
the mid-90s (egged on by one of my friends who had a use for it), I wrote
some software that was designed to do a generalized mathematical attack
against the search engine ranking systems.
The exploit is simple in theory: All ranking systems intrinsically describe
the ranking algorithm they were ordered by, per information theory. If you
have the source data (the page content in those days) and the ordered
results (from the search engine), the ranking algorithm can be computed.
The more ordered samples you have the more accurate a model of the ranking
algorithm you can build, and the search engines were more than willing to
give you an enormous data set.
So I wrote software that did exactly this. It automatically ran a bunch of
high-volume queries (like for porn terms) against search engines, did an
enormous amount of esoteric numerical churn on the pages returned, and
generated a report that not only described the general ranking algorithms
used in detail, but also could generate the requirements to get an arbitrary
page ranked for a specific query and could generate diffs against existing
web pages. In practice, it worked like a charm. Other search engine
spammers called up my friend offering to pay him to rank their stuff,
because he was reliably bumping them off the search engines; it was a
relatively small community. In fact, the whole business of "page ranking"
for money that became popular emerged from this. I never made a dime from
it or used it to spam search engines; it was mostly a theoretical challenge
Search engines that can effectively obscure the source data used in the
ranking algorithms, like Google, are substantially less vulnerable to the
type of attack I used. Page ordering using data that is not trivially
collectable is an effective defense against this, though not infallible. I
don't think the payoff from Google spamming would justify the expense and
technical effort required to effectively pull it off.
jamesr at best.com
More information about the FoRK