external searching bugzilla database, robots.txt

Tue Feb 22 03:24:07 UTC 2005

Jason Remillard wrote:

> Now that I have a clue, I see that there is a bug (81920) against not 
> letting in search engines. It is marked as “will not fix”. I am not sure 
> if it was marked as do not fix because letting in search engines was 
> considered a bad idea, because the solution suggested in the bug was 
> bad, or because it was not going to be done for the current release (2.16).

The main reason it's been WONTFIXed in the past is because of the load 
it would cause to the server if we allowed spiders to index it.  Google 
themselves are usually pretty good about this (they hit pretty 
infrequently to space out the traffic they generate), but some other 
spiders aren't, and some of them ignore the useragent restriction in the 
robots.txt ("if Google's allowed to get it then we will too") but don't 
space their requests out and it kills the server.

Another reason (which Gerv pointed out) is that Google only scrapes a 
given site once a month, and Google's index could be hopefully out of 
date within days at the rate some things happen on bugs.

> Q: Would a patch to allow indexing by search engines not get accepted 
> because of policy decision against external indexing? Or is simply a 
> matter of getting an implementation that would work well?

Each site, of course, can change robots.txt as they see fit.  If you're 
not already handling thousands of users, your site probably won't take 
much additional hurt by having a few spiders hitting it.

> The best way I see this working is to add a link somewhere that would 
> bring up a series of "browse" pages. The browse pages would form a 
> hierarchy (perhaps off of the advanced search page).
> 
> Product List or Products -> Bugs opened by year-week number –> Links to 
> the bug numbers
> 
> This would look like a real browse to the search engines, and hopefully 
> not kill the server because it would be a set of nested pages, allowing 
> the queries to be broken up into lots of small chunks. I think this 
> would be pretty easy to put together, but I don’t want start on it if 
> everybody is dead against the idea of indexing.

Yeah, this seems like the best idea.  I think there's a bug already on 
having a way to browse bugs (not even related to search engine indexing) 
but if done well, that could allow indexing to be done without creating 
a lot of load.

-- 
Dave Miller                                   http://www.justdave.net/
System Administrator, Mozilla Foundation       http://www.mozilla.org/
Project Leader, Bugzilla Bug Tracking System  http://www.bugzilla.org/