Search.pm SQL queries

David Marshall dmarshal at yahoo-inc.com
Thu Feb 26 20:24:18 UTC 2009




On 2/25/09 6:56 PM, "Max Kanat-Alexander" <mkanat at bugzilla.org> wrote:

> On Wed, 25 Feb 2009 14:40:49 -0800 David Marshall
> <dmarshal at yahoo-inc.com> wrote:
>> (I am working semi-furiously on
>> catching us up and will expose whatever code I can as soon as
>> possible.)
> 
> Cool. We refactored Search.pm, so if you gave us a patch right
> now it wouldn't apply, but the actual internal logic of all the search
> operators hasn't really changed much, so it shouldn't be too hard to
> port.

The greater likelihood for this summer is that we'll be able to throw some
code out as inspiration for others to make patches, but barring that, I'll
eventually produce patches myself.  OTOH, every time I have thought, "Cool,
I can starting pushing stuff upstream in a few weeks," some new important
thing to do shows up and upstream patches move farther back on my calendar.

> 
>> I anecdotally suspect that for at least MySQL 4.1, when the number of
>> possible query plans exceeds some number (around 24), MySQL gives up
>> and just goes with some arbitrary join order, usually the order as
>> written in the query.
> 
> BTW, this makes me wonder--have you tried an experimental
> import into PostgreSQL to see if it gives you better query plans/times?
> I know that Yahoo has a big investment in MySQL and so wouldn't move
> off it, but I'd be interested to know if there really is a significant
> difference. I saw a fair difference in some of my testing a few years
> back, but I didn't have a DB like you guys do.

Too much secret MySQL sauce for this to be a realistic possibility.  Oracle
would be more likely, but that's not much more realistic.

> 
>> WHERE bugs.bug_id IN (list)
> 
> Isn't that IN slow, though? I thought MySQL stopped using
> indexes on IN clauses over a certain size, or something.

It could be, but it's probably not going to be any slower than the
alternative.  I admit that it's probably not the best answer, but it
addresses the problem we have now with the technology we have.

> 
>> Finally, what has made the biggest difference is based on MySQL 4.1's
>> inability to use multiple indexes at once.
> 
> Supposedly fixed in 5.1 or 5.0, though. I'm not sure I've seen
> any query plans where it actually does that though.
> 
>> Problems with Search.pm have less to do with the transformation of
>> some chart tuple to SQL and more to do with the philosophy on
>> constructing one big query and hoping for the best.  Making this more
>> complicated is that for some searches, it is better to just leave
>> things alone and let the database figure it out!
> 
> Yeah, although I'd like to think in an ideal world that the
> database should always be the best at figuring these things out...or at
> least that query optimizers will improve in the future.

Agreed 100%.  The ultimate answer isn't fiddling with the SQL statement -
it's having a query optimizer that can do the same thing automagically.  I
have all kinds of clever ideas (for instance, a Bayesian SQL rewriter that
could choose the best SQL for your particular search), but some of those are
best reserved for if I ever forget why I don't want to get a PhD.

> 
>> I used to say that Search.pm should only return a list of bug IDs
>> that match the search criteria, and then buglist.cgi should get the
>> data needed for display.  I'm wavering on that now, because as long
>> as Search.pm has gone to the trouble of figuring everything out, why
>> not just get the data?
> 
> Yeah, I think eventually Search.pm will just return Bug
> objects.

Generally speaking, this is a good idea.  That would add a little bit of
cost, but I think it is worth it.

> 
>> The future direction of Search.pm for us is to improve the tree
>> mechanism to produce faster queries.
> 
> And perhaps to upgrade to MySQL 5.x and see if that improves
> things?

We're moving some of our shadow databases to MySQL 5.1 very soon to see what
happens.  My only concern is whether "improving" queries for 4.1 will
actually be counterproductive in 5.1.  I see a lot of slow-query log
browsing in my future.

> 
>>  This will probably involve locking all the tables involved
> 
> Probably unnecessary, since 3.2 uses InnoDB? Or you could do
> READ UNCOMMITTED and see if that makes a difference?
> 
>> and doing possibly multiple
>> queries to determine the list of bugs followed by one grand query to
>> get all the data for those bugs,
> 
> Yeah, that might be reasonable. I'd want to see what the actual
> pathological queries are, though, if we're going to adjust things to
> work better upstream. I'm sure you guys have a lot of data that would
> be valuable to us in terms of optimization.
> 
> I'd be opposed to any optimization before we finish refactoring
> Search.pm architecturally, though, so that should happen first.

I absolutely agree but wanted to provide some information about what we've
learned about Search.pm for folks who are thinking about it now.  Effort is
better spent, in my opinion, on the refactoring to make it more easily
understood and maintained than on having it emit better SQL, at least for
those without large databases.

> 
> -Max




More information about the developers mailing list