Cleaning our feeds index

When we have started we collected feeds indiscriminately where feeds includes blogs, splogs, news feeds, bloggers' comment feeds, small aggregators' feeds, search engines results feeds as well as link farms who discovered RSS as a convenient way to multiply their distribution.

Recently we have decided to focus on bloggers only (as a blog search engine should do anyway) and I can tell it is a very hard job to detect bloggers from the huge piles of feeds scattered in the blogosphere. As part of this effort we have already downsized our index from 1 million feeds to 750K, which was already clean from spam blogs (we have several millions of spam logs that are blacklisted from our index). We hope that once our index will include mainly and mostly real bloggers then real unique voices will emerge within search results instead of highly ranked news sites.

I know not all blog search engines see this as a problem, some even take advantage of the fact they index news and blogs and position themselves accordingly. I feel that although top bloggers do rise up in search results thanks to high linkage patters equivalent to what news sites receive, still, other voices which are not so distinct but are not less important do not get their share of exposure.

Anyway, we will see:)