Query Improvement (was Re: Bugzilla Contribution Process (was RE: New language discussion?))

David Marshall dmarshal at yahoo-inc.com
Fri Nov 2 23:06:22 UTC 2007

On Nov 2, 2007, at 3:07 PM, Benton, Kevin wrote:

>>> So are we for the purpose of general use, however, we need the
>>> denormalization to keep large volume queries operating relatively
>>> quickly.
>> 	I wonder if they could be done just as quickly in the way Yahoo
>> does it, with UNIONed selects with simple WHERE clauses.
> How would that work?  Do tell! :-)  The more we can get away from
> denormalization, the easier it becomes to maintain code.  I'd be  
> glad to
> have an off-line with you and post the results back to the group.  I
> haven't used UNION much at all myself.  I bet I'm not alone...

The technique is pretty simple, but the implementation is slightly  
less so...

The source of the idea is Yahoo!'s own Jeremy Zawodny (plus probably  
many other inventors of the same technique):


Consider this statement:

AND (assigned_to = 53922 OR reporter = 53922)

With the exception of very small tables, MySQL cannot use either the  
index on assigned_to or the index on reporter.  For our database,  
with about 1.5M rows in table bugs, the query optimizer uses the  
index on bug_status index.  Yuck!  Long, slow, query.  It takes 2.55  
on my development database.

If we are writing queries by hand, this query can be trivially  
improved to:

AND assigned_to = 53922
AND reporter = 53922

Same results, but it takes 0.03 seconds on my development database.   
Big win!

To use this in Bugzilla, the trick is to know when to transform  
statements with ORs into statements with UNIONs and then how to do so.

To do it, I modified Search.pm to create what I call a "predicate  
tree."  It's a tree of ANDs or ORs.  I then transform that tree into  
a list of equivalent trees, run the query for each resulting  
predicate, and then combine the results.  I don't actually create  
statements with UNION in them, but the idea is the same.

Unfortunately, there are a lot of traps in this approach.  It's easy  
to transform so-so queries into really bad queries.  We have a very  
simplistic heuristic for deciding when to transform a query or to  
just let it go.

For us, the end result is that we add about 30 milliseconds to every  
query on average, but the "worst" queries finish in a few seconds  
rather than several minutes.

Furthermore, the main problem that we are solving this way manifests  
itself really only when replicating to a shadow database.  If a very  
slow query is processing when a replication statement shows up, the  
replication must wait for the slow query.  Then, any new queries must  
wait for the replication to finish.  This can lead to a horrible logjam.


The largest Bugzilla database, particularly when replicated to a  
shadow database, can exhibit poor performance resulting from advanced  
searches.  The archetypal slow query is searching for all bugs that  
have been assigned to, reported by, or CCed by a particular user.

Yahoo! has worked around the biggest bottlenecks by transforming some  
of these slow queries into a series of faster queries.

However, this is really just an extreme example of the overall  
scalability problems that pervade Bugzilla (at least up to 2.22.0).   
Many sections of code interact with the database in ways that become  
very expensive as the database grows large.  (I don't intend that as  
a criticism of Bugzilla in any way; I can only hope that our code  
would perform as well if someone uses it on a database one hundred  
times larger than ours.)


If you are trying to squeeze more performance from your Bugzilla,  
there are probably other ways to do it.  This approach is crafted for  
a very specific set of circumstances.

What's Next:

First, I need to generalize the "predicate tree" notion into a CPAN  
module.  (I will be stuck in Virginia and Pittsburgh in December/ 
January, so I am hopeful that this will happen no later than that time.)

Second, we need to get our Bugzilla a little bit closer to current  
Bugzilla so that we can make meaningful patches.

Then we can start to roll out our changes to Search.pm.  Because most  
Bugzilla databases are pretty small, it's probably not going to have  
any noticeable effect on the performance of anyone's Bugzilla, unless  
you have at least 500K rows in your bugs table.

Note: If you *do* have 500K rows in your bugs table and would like to  
discuss previewing some of our changes, please contact me directly.

More information about the developers mailing list