DEBRIS.COMgood for a laugh, or possibly an aneurysm

Tuesday, March 15th, 2005

Syndicated search results from A9.com

The A9 team at Amazon.com launched a new extension to their search engine, just in time for ETECH. Jeff Bezos announced it during this morning’s “High-Order Bits” presentations.

By default, A9.com’s search results contain three columns: web results, image results, and a button column that gives one-click access to other types of results, such as Movies and Books. These buttons provide functionality similar to the text links at the top of both Google’s and Yahoo’s results pages, which offer access to images, USENET, news, Yahoo’s directory, Froogal, etc.

A9 allows users to customize their search-result columns. Users can personalize their results pages to show the types of documents they’re most likely to be searching for. This is a cool thing, and will soon be a feature of every major search engine.

But the really cool announcement of the day is that developers can create their own A9 search columns. This allows what Bezos called “domain experts” to syndicate “vertical search” results. For example, I could build an A9 search-results column for debris.com, if in fact I’d written enough about a particular topic that merited a syndicated search feed to a major search engine.

There are already dozens of custom syndicated vertical-search result “columns” available, including the NYTimes, Flickr, PubMed, NASA, etc. I predict the number will grow very quickly, as site owners realize the value of putting their content in front of the eyeballs of A9 users.

In a sense, A9’s OpenSearch technology is like Apple’s Sherlock, which is a search technology that accepts plug-ins to provide vertical search results via syndication. Two key differences are that OpenSearch is built on open standards like RSS, and A9 is a website (available to 100% of web users) whereas Sherlock is a proprietary software product (available to ~5% of computer users).

See also Cory Doctorow’s notes: Bezos on vertical search and A9


Tags:
posted to channel: Web
updated: 2005-03-16 02:03:17

Thursday, March 3rd, 2005

gadgets

Bim forwarded a link to Mobile PC Magazine’s Top 100 Gadgets of All Time. I was immediately enveloped in warm waves of nostalgia: Rubik’s Cube (#89)! Pong (#70)! Lite-Brite! (#77)

I make a pretty lousy technophile though. I’ve only owned about five of the top 100 (Pez (#98) and Etch-a-Sketch (#50) are the other two). I’ve owned more-recent versions of the Leatherman (#67) and Cuisinart (#69). And I used to sit in the window of Radio Shack to hack on the TRS-80, which ought to count for something.

The Commodore 64 should have been on this list. It was “the” Christmas gift, whatever year that was, just as Pong had been a few years before. But maybe I’m just embarassed that I’m generally too suspicious and/or oblivious to be an early gadget adopter, and I’m grubbing for points.

One of the item descriptions in the list contains a reference to the Hitchhiker’s Guide to the Galaxy… did you spot it?


Tags:
posted to channel: Web
updated: 2005-03-04 06:14:34

Friday, February 18th, 2005

more on REFERER spamming

Prior to implementing the HTTP_REFERER blacklist described previously, I investigated the source of the faked HTTP requests. If they were all coming from the same place, I could simply block access from that address.

But the attacks are distributed: they come from many IP addresses on many networks. Here’s an example, showing the request count and source address for all hits to this site containing the work pokerin the past week:

nsa /var/log/httpd : cat debris_access_log | 
grep poker | awk '{print $1}' | sort | 
uniq -c | sort -rn | head
     91 65.165.84.11
     27 68.22.118.212
     20 12.172.137.13
     14 195.30.153.194
     13 38.223.231.8
     13 212.211.130.248
     12 203.199.92.158
     11 65.88.84.205
     10 168.11.16.22
      9 82.148.70.171
Just to confirm that the methodology above isn’t whacked, here are the faked REFERERs from the top IP address:
nsa /var/log/httpd : grep 65.165.84.11 debris_access_log | 
awk '{print $11}' | sort | uniq -c | sort -rn | head
      8 "http://www.nutzu.com/poker-hands.html"
      8 "http://www.nutzu.com/free-texas-hold-em.html"
      7 "http://www.nutzu.com/internet-poker.html"
      7 "http://www.nutzu.com/free-online-poker.html"
      6 "http://www.nutzu.com/world-series-of-poker.html"
      6 "http://www.nutzu.com/strip-poker.html"
      5 "http://www.nutzu.com/poker-tournament.html"
      4 "http://www.nutzu.com/texas-holdem-poker.html"
      4 "http://www.nutzu.com/rules-of-poker.html"
      4 "http://www.nutzu.com/poker-tables.html"
How could the referer spammers be operating from so many different networks? Here's my best guess: all those IP addresses represent Wintel machines that have been hijacked by viruses and trojan horses, and they're running distributed REFERER attacks without the knowledge of their owners. The machines are probably sending tons of spam email, too.

So when I previously said "this is all Google's fault," what I really meant is "this is all Microsoft's fault."

(In Microsoft's defense, they've only been working on making Windows more secure for two years... I'm sure they'll have some meaningful progress to report RSN.)


Tags:
posted to channel: Web
updated: 2005-02-19 07:30:28

Thursday, February 17th, 2005

HTTP_REFERER spamming: the mob found my website

Like most webmasters, I keep track of the websites that link to this one. In the jargon of my people this is called “referer logging,” which is short for “HTTP_REFERER logging,” which I include for the benefit of GoogleBot.

Starting a few months ago, my referer logs became worthless; they were filled with sites that couldn’t possibly be linking to mine: paris-hilton-video.blogspot.com, www.texas-holdem-poker-downloads-4u.info, viagra.hosting4u.gb.com. In other words, even though those sites did not contain links to debris.com, my logs looked as if hundreds of people per day were clicking through from there to here.

Why would anyone bother to fake clickthroughs? Because some websites automatically display the URLs other readers have clicked through from. The gambling and porn site owners are hoping debris.com will automatically display, and link to, their URLs.

It’s all Google’s fault. Google’s PageRank system counts inbound links as relevance votes: the more sites link to website X, the more relevant website X must be. So, if a million weblogs link to paris-hilton-viagra-holdem-poker.org, then paris-hilton-viagra-holdem-poker.org will show up high in Google’s search results for any search on related terms.

So, some unknown fuckwit, or collective of fuckwits, operates software that hammers on my site (and countless others, I’m sure), with the page requests faked to make it look as if readers are clicking through from various gambling and porn and pharmaceutical sites, in a lame attempt to raise their PageRank scores.

There are numerous problems with this strategy:

  1. My site doesn’t display referers, so no benefit has ever been realized by the spammers.
  2. 90% of the spamvertised URLs get shut down within a day anyway, e.g. last night’s variation, http://www.nutzu.com/internet-poker.html, so even if my site did automatically display referers, the referers would have been shut down before Google’s spiders would have counted the links as valuable PageRank votes.

The fact that the strategy is a failure doesn’t make it any less of a hassle for me. My ISP recently began charging me surplus-bandwidth fees, because all the sites I host are serving more data than I projected or paid for. Yet a measurable percentage of the bytes served by this site were not actually being seen by humans. I’m paying for the traffic generated by the referer-spammers’ software robots.

Preventing this abuse requires daily maintenance, because the spamvertised URLs change frequently. A few general keywords like poker, holdem, and viagra trap most new attacks; these are trapped hourly by a scheduled script that scans recent logs and updates the blacklist with matching domains. Every second or third day, I manually examine the logs in search of new attacks that don’t happen to match any of the keywords I’ve already defined.

So now when these robot scripts pound on my site, instead of serving up 15-20k of glorious debris.com content, the software engine that generates these pages returns a brief error message.

Frankly, the bandwidth savings are miniscule compared to the amount consumed by people abusing the MP3s and graphics. But they’re next in line.


Tags:
posted to channel: Colophon
updated: 2005-02-18 23:43:16

Wednesday, January 26th, 2005

extreme close-up

Eye of Science: Life in a Microcosmic World


Tags:
posted to channel: Web
updated: 2005-01-27 23:26:43

Search this site


< May 2024  
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  


Carbon neutral for 2007.