Search.pm SQL queries

Wed Feb 25 22:40:49 UTC 2009

There are several areas for improvement with Search.pm.  I have fixed some
of them for Yahoo!, but because we are so far behind HEAD, I don't have
patches to submit yet.  (I am working semi-furiously on catching us up and
will expose whatever code I can as soon as possible.)

This is a lengthy discussion aimed only at Search.pm geeks.  You have been
warned!

Data: We use MySQL 4.1 and have about 2.5M rows in table bugs, about 11M
rows in table longdescs.

I anecdotally suspect that for at least MySQL 4.1, when the number of
possible query plans exceeds some number (around 24), MySQL gives up and
just goes with some arbitrary join order, usually the order as written in
the query.

This means that if Search.pm produces a query such as

SELECT (many columns)
FROM (many joined tables)
WHERE (nasty predicate)

the executed query plan will probably suck.

For some queries we now do

SELECT bugs.bug_id
FROM (minimum joined tables)
WHERE (nasty predicate) LIMIT 2X

followed by

SELECT (several columns)
FROM (many joined tables)
WHERE bugs.bug_id IN (list) LIMIT X

We don't use a subquery because if the number of potential bugs is very
large, it may cost a lot more to do it this way.

Another problem is how confusing the predicate can be - there can be lots of
AND/OR/NOT with various paretheses involved.  For the biggest improvement we
use, that's bad.  We convert the charts to a tree of AND/OR nodes,
transforming subtrees for negation as needed, combining siblings where
possible, and so on.  What results is a least complicated predicate,
including changing (for instance) "NOT foo IN (1,2,3)" to "foo NOT IN
(1,2,3)".  That may be a poor example, because the database probably sees
those as equivalents.

Finally, what has made the biggest difference is based on MySQL 4.1's
inability to use multiple indexes at once.  Our prime example was to search
for bugs where user 123 is either the owner or the QA contact:

SELECT (columns)
FROM (tables)
WHERE (blah) AND (bugs.assignee = 123 OR bugs.qacontact = 123)

Disaster!  Such queries almost always uses bug_status as an index, which
doesn't work very well for us.

We rewrite this as (essentially)

SELECT (columns)
FROM (tables)
WHERE (blah) AND bugs.assignee = 'foo'
UNION
SELECT (columns)
FROM (tables)
WHERE (blah) AND bugs.qacontact = 'foo'

Actually, we do the first query (and assuming it returns some bugs) do the
next query as

SELECT bugs.bug_id
FROM (tables)
WHERE (blah) AND bugs.qacontact = 'foo' AND bugs.bug_id NOT IN (list)

so that if (blah) is expensive we don't have to pay it for rows we already
know are in the result set.  Of course, we lock all the tables for these
queries to get a consistent view of data.

Summary and future direction:

Problems with Search.pm have less to do with the transformation of some
chart tuple to SQL and more to do with the philosophy on constructing one
big query and hoping for the best.  Making this more complicated is that for
some searches, it is better to just leave things alone and let the database
figure it out!

I used to say that Search.pm should only return a list of bug IDs that match
the search criteria, and then buglist.cgi should get the data needed for
display.  I'm wavering on that now, because as long as Search.pm has gone to
the trouble of figuring everything out, why not just get the data?

The future direction of Search.pm for us is to improve the tree mechanism to
produce faster queries.  This is important for us because our database is
growing very quickly.  This will probably involve locking all the tables
involved and doing possibly multiple queries to determine the list of bugs
followed by one grand query to get all the data for those bugs, possibly
including stuff in memcached or Sphinx, for instance.

On 2/25/09 2:00 PM, "Michael Leupold" <lemma at confuego.org> wrote:

> On Wednesday 25 February 2009 14:28:49 Joel Peshkin wrote:
>>> I've recently been scouting around Search.pm, trying to do something
>>> about https://bugzilla.mozilla.org/show_bug.cgi?id=476722 (no promises
>>> something will come out).
>>> 
>>> I stumbled upon some queries where I'm not entirely sure why they are
>>> implemented like that, eg. _commenter:
>>> ---------------
>>>     $$f = "login_name";
>>>     $$ff = "profiles.login_name";
>>>     $$funcsbykey{",$$t"}($self, %func_args);
>>>     push(@$supptables, "LEFT JOIN longdescs AS $table " .
>>>                        "ON $table.bug_id = bugs.bug_id $extra " .
>>>                        "AND $table.who IN" .
>>>                        "(SELECT userid FROM profiles WHERE $$term)"
>>>                        );
>>>     $$term = "$table.who IS NOT NULL";
>>> ---------------
>>> 
>>> I don't really understand why there's a LEFT JOIN followed by a "who IS
>>> NOT NULL" which seems to be the same as an INNER JOIN. Not being a
>>> database guru I'm also not sure if the subquery is supposed to perform
>>> faster than another join.
>>   Boolean charts surround entire WHERE clauses with operators including
>> NOT( ).  Add the debug option to some queries where you look for a bug
>> assigned to someone who has never commented and look at the resulting SQL.
>>   That said, the logic goes back to MySQL 3 days.   There may be
>> opportunities to optimize more with subqueries.
> 
> Thanks for the pointer. I really didn't consider negation at all.
> 
> That said it seems the subquery is pretty slow for me. The following seems to
> run considerably faster while providing the same functionality:
> LEFT JOIN (longdescs
> INNER JOIN profiles
> ON longdescs.who = profiles.userid
> AND $$term)
> ON longdescs.bug_id = bugs.bug_id
> WHERE longdescs.who IS NOT NULL
> 
> I only tried this on MySQL 5.1 as I don't have a postgres bugzilla database
> around currently.
> Unfortunately like this it's still not possible to move the actual query
> condition $term into a WHERE clause which is what I tried to achieve in the
> first place.
> 
> Regards,
> Michael
>