Thursday, October 6th, 2005
Continuing Web 2.0 coverage…
I’m fascinated by Technorati, the original blog search engine, for a couple of reasons:
- The technical challenge they’ve taken on is astounding — they’re facing a Google-class scaling problem. (See statistics below.)
- and perhaps due to the first point, the performance of the system has, in my experience, been poor. Searches take 30-60 seconds. I’ve never used the site because I don’t have the patience for it.
- Even so, the recent public abandonment/shaming of Technorati by brand-name bloggers strikes me as somewhat unfair.
- New entries into Technorati’s blog-search market, e.g. Google’s blogsearch site, introduce a huge challenge to Technorati’s long-term success, especially if it’s true that Technorati is not executing well.
Technorati’s archive (a retrospective search architecture, btw) is tiered into 3 layers based on recency. They’ve made the assumption that the most recent hits for a search are more interesting than older hits, and of course in many cases that’s true. But the data i/o requirements, of simultaneously pushing new data into the front of each archive while expiring it out the other side (and from there into the next-most-recent archive) would seem to be unmanageable.
I learned this week that the write to read ratio within technorati’s database is 5:1. That presents a horrible scaling problem, because writes are slow and can’t be easily parallelized. Compare to Tribe.net’s write to read ratio: 1:99. Tribe’s is probably an extreme example (in fact I suspect that figure needs to be qualified) but in any case if your write rate is bigger than your read rate, you’ll be hurting under load. E.g., searches will take 30-60 seconds.
Part of Technorati’s problem is volume. Check out these “state of the blogosphere” stats from Dave Sifry:
- There are now 18.9 million blogs
- 70,000 new blogs appear every day, about one per second
- 55% are still active 3 months later
- 13% of blogs are updated weekly or more
It’s a big pile of data to read, store, and index.
Tags:
posted to channel: Web
updated: 2005-10-06 20:55:36
Wednesday, October 5th, 2005

October 5, 2005 - SAN FRANCISCO - Today, Google announced the launch of Google Ice, an innovative new product in the beverage-cooling market. Google Ice is available exclusively at the Web 2.0 Conference in San Francisco, but will soon be rolled out within after-hours tech confererence geek lounges nationwide, except perhaps in Redmond.
“Google has already made tremendous strides in making access to information on the web a reality for users across the globe, but we’re still in the Internet’s early innings,” said Google CEO Eric Schmidt. “But at least now you can see what you’re drinking.”
“We hope this product makes quite a… splash,” added Google founder Sergey Brin. “Heh, heh, heh. Heh. Ahem.”
Added Stewart Butterfield, founder of Flickr and employee of Google rival Yahoo, “5000 Ph.D.’s and they come up with this?!”
Tags:
posted to channel: Web
updated: 2005-10-06 06:56:14
90 minutes, 13 companies, six minutes apiece. (Six minutes… sort of like the elevator pitch at Sears Tower.)
- SocialText, Ross Mayfield
SocialText is the first wiki company. 30% of email is group communications… but email is not a great fit for that application. Wikis are superior (collaboration, searchability, revision, durability). SocialText is open-source. See also: wikiwyg.net — rich-text formatting for textarea fields; SynchroEdit — “a browser-based simultaneous multiuser editor, a form of same-time, different-place groupware.”Update: In a followup conversation, Ross Mayfield told me the 30% figure is apparently from a Gartner Group study. The term they used for this class of email is “occupational spam.”
- Rollyo, Dave Pell
Roll your own search engine: enter urls to relevant sites. Presumably this works by filtering Yahoo’s search results based on the user’s “searchroll” domain list. “Yahoo provides the engine. we provide the steering wheel.” - Joyent, Inc., David Young
Some kind of groupware appliance. They’re launching a hosted version soon. - Bunchball [no link because it locks up my browser], Rajat Paharia
A development platform that provides the “plumbing” (infrastructure) for web-based interactive applications and games. The appeal is mostly to developers, and the friends of developers who come in to use the apps. Flash-based for now (additional UI toolkits in the works?). - RealTravel, Ken Leeder
Travel information site incorporating blogs, photo albums, google maps… goal is to provide real travel information written by real people. Share recommendations (or warnings) with the community. Looks like a well-integrated solution. The homepage links to good examples (“must-see travel blogs”). - Zimbra, Satish Dharmara
I thought this would be yet-another open-source dev platform (not that that’s a bad thing) but it turned into the best-received, most-applauded presentation. This happened during a demo of an email client built on the Zimbra suite. The message text was apparently scanned for recognizeable structured data types:- mousing over a URL pops up a thumbnail preview of the named website
- mousing over a street address pops open a Google Map of the address
- mousing over a date pops open the user’s calendar (another Zimbra app)
- mousing over a phone number pops open a Skype window
- mousing over a Fedex tracking number pops open a tracking application
This was incredibly cool. I want this in my email client right now.
- Zvents, Ethan Stock
I thought they said that Zvents has 3x as many events as their nearest competitor, but that doesn’t follow from the site’s Bay Area focus. Anyway, there is some neat functionality promised, like smart calendars, whizzy iframe integration of live event feeds into 3rd-party blogs (e.g., show area dynamic music-event listings on your music blog). - KnowNow, Ron Rasmussen
KnowHow is an unobtrusive notification system, a browser toolbar that pops up alerts based on your subscriptions. Basically it’s an RSS aggregator that delivers new data via instant messaging. But: IE/Win only. Bah. - Orb, Ian McCarthy
Access your digital content from anywhere. Run the client on your home PC, and stream audio/video to your office PC… or wifi MP3 player… or phone… There’s a file browser, too. Seems like a really cool thing, except that it’s also Windows only. Bah again. - Wink, Michael Tanne
Wink is a new search solution, based on leveraging human input or — dare I say? — the wisdom of crowds. One of their innovatians is a ranking system they call TagRank (a play on Google’s PageRank), which rates the quality of search results by tag popularity, or something along those lines. They’ll also allow users to manually rank search results. - Allpeers, Matthew Gertner
“transforming firefox into a web 2.0 development platform…” Share media files with friends and family via peer-to-peer networking. It’s free and cross-platform. I have no idea where the revenue stream is. - Flock, Bart Decrem
Flock is a new web browser with social interaction built in. It’s based on the Mozilla platform, which is a lot less insane than writing a browser from scratch. The alpha release will be out in a couple weeks, offering some neat features: the favorites button both makes a local bookmark and submits the URL to del.icio.us; it has a built-in RSS reader; it has a built-in blog editing tool with a drag-and-drop Flickr interface — i.e., drag a photo from your Flickr pool directly into a blog posting. For bloggers, this could be pretty neat. - PubSub, Bob Wyman
PubSub is another search-engine play, but it works in reverse. Most search engines are “retrospective,” meaning, they create an index of data and run queries against it at the time the query is submitted by the user. PubSub calls this “searching the past.” It’s an interesting distinction, and absolutely true; even in the best of cases, the newest data in Google’s index is a couple days old (in my experience). PubSub’s search is “prospective,” in that they maintain no index or archive. Rather, users write and save queries. There are no immediate results, in most cases, because there’s no index to search. Instead, PubSub monitors thousands of inbound links and performs matches in realtime. So, write a query today, and check back tomorrow for fresh results.
The amazing thing about PubSub, the thing that merits a 3rd paragraph, is the processing speed of the system. They currently have about 700,000 saved queries, and they monitor millions of websites. For every new website update, they match it against the 700k saved queries. It comes out to over a trillion matches per day. And they run it on one dual-cpu Xeon box. That’s just astounding to me. They claim they can scale to 1.5M saved queries per box. Figure the typical user has 5 saved queries… that’s 300,000 users per box. Seems like cheap scaling to me.
Update: In a followup conversation, Bob Wyman told me that the figure of “2-3 trillion” matches per box per day is actually an “effective” rate — there is some duplication in the query list. The number of unique matches per day is smaller.
Most users, he said, save about 3 queries. “Their name, their blog…” I offered. “And their employer,” said Wyman. Me, me, and me.
I pointed out that PubSub faces a nontrivial usability challenge, in that users submit a query and get 0 results. Whether it’s a superior experience in the future or not, in the moment it feels like a failure. Wyman acknowledged this and said they’re a switch-throw away from using their own retrospective search engine to show the most recent 32 matches. They have the data, but they’ve chosen not to show it.
Tags:
posted to channel: Personal
updated: 2005-10-06 19:41:43
Continuing Web 2.0 coverage…
I snuck into the Tagging workshop when the guard’s back was turned. A couple great ideas were presented even in just the final 15 minutes.
- “the first derivative of tag popularity is interestingness.” Tag recency is a hugely valuable piece of metadata. Check out Flickr’s “hot tags” feature, broken down into categories (last 24 hrs, last 7 days). If you store timestamps with tags, generating these lists is pretty easy.
- community tagging is one of the next big things — what tags are new/popular in my community? For example, look at a geographic area as a community, e.g. “what tags are most popular in Berlin?”
- a neat new feature idea with no obvious implementation: expiration of interest. Let the user tag something in a way that expires after a preset period, e.g. “I’m tagging things about Greece now, but after my trip next month I don’t care any more.” I need to flesh this out more.
Also, I made this important technical discovery: when 200 people crowd into a room built for 150, it’s at least 5°F cooler on the floor.
Tags:
posted to channel: Personal
updated: 2005-10-05 20:34:37
6:45 am depart home
7:05 am arrive highway
7:06 am traffic jam. crawl about 3 miles in 40 minutes.
7:45 am brief respite from traffic, as inexplicable as the jam itself. no sign of accident or lane closure. drive 75 mph for 5 minutes. celebrate, briefly.
7:50 am another traffic jam. crawl about 3 miles in 40 minutes.
8:15 am seethe
8:30 am Web 2.0 conference starts without me. But! traffic opens up (inexplicably).
8:42 am arrive GGB toll plaza in just under 2 hours, avg speed = 35 mph
8:43 am hit traffic 1 mile past the bridge. WTF?
9:00 am seethe
9:10 am arrive Argent hotel. Elapsed time = 2 hrs 25 minutes for a 65 mile drive.
I rushed upstairs to the first workshop. The doors were closed. The guard said even the standing room was taken.
Later I learned that at the first Web 2.0 conference, only half the registered attendees showed up for the morning workshops. The organizers planned for 50% turnout this year. But apparently just about everyone showed up. All the workshops are overflowing into the hallways. People sit hip-to-hip or stand three deep in the corners. Between conferences, corridors are completely gridlocked.
The organizers are coping as well as they can, but as the morning rolls on only more people arrive.
Tags:
posted to channel: Personal
updated: 2005-10-05 20:30:28