DEBRIS.COMgood for a laugh, or possibly an aneurysm

Thursday, October 6th, 2005

blogosphere stats from Dave Sifry

Continuing Web 2.0 coverage…

I’m fascinated by Technorati, the original blog search engine, for a couple of reasons:

Technorati’s archive (a retrospective search architecture, btw) is tiered into 3 layers based on recency. They’ve made the assumption that the most recent hits for a search are more interesting than older hits, and of course in many cases that’s true. But the data i/o requirements, of simultaneously pushing new data into the front of each archive while expiring it out the other side (and from there into the next-most-recent archive) would seem to be unmanageable.

I learned this week that the write to read ratio within technorati’s database is 5:1. That presents a horrible scaling problem, because writes are slow and can’t be easily parallelized. Compare to Tribe.net’s write to read ratio: 1:99. Tribe’s is probably an extreme example (in fact I suspect that figure needs to be qualified) but in any case if your write rate is bigger than your read rate, you’ll be hurting under load. E.g., searches will take 30-60 seconds.

Part of Technorati’s problem is volume. Check out these “state of the blogosphere” stats from Dave Sifry:

It’s a big pile of data to read, store, and index.


Tags:
posted to channel: Web
updated: 2005-10-06 20:55:36

follow recordinghacks
at http://twitter.com


Search this site



Carbon neutral for 2007.