Query Improvement (was Re: Bugzilla Contribution Process (was RE: New language discussion?))
David Marshall
dmarshal at yahoo-inc.com
Fri Nov 2 23:06:22 UTC 2007
On Nov 2, 2007, at 3:07 PM, Benton, Kevin wrote:
>>
>
>>> So are we for the purpose of general use, however, we need the
>>> denormalization to keep large volume queries operating relatively
>>> quickly.
>>
>> I wonder if they could be done just as quickly in the way Yahoo
>> does it, with UNIONed selects with simple WHERE clauses.
>
> How would that work? Do tell! :-) The more we can get away from
> denormalization, the easier it becomes to maintain code. I'd be
> glad to
> have an off-line with you and post the results back to the group. I
> haven't used UNION much at all myself. I bet I'm not alone...
>
The technique is pretty simple, but the implementation is slightly
less so...
The source of the idea is Yahoo!'s own Jeremy Zawodny (plus probably
many other inventors of the same technique):
http://www.oreilly.com/catalog/hpmysql/index.html
Consider this statement:
SELECT * FROM bugs WHERE bug_status IN ('NEW, 'ACCEPTED', 'REOPENED')
AND (assigned_to = 53922 OR reporter = 53922)
With the exception of very small tables, MySQL cannot use either the
index on assigned_to or the index on reporter. For our database,
with about 1.5M rows in table bugs, the query optimizer uses the
index on bug_status index. Yuck! Long, slow, query. It takes 2.55
on my development database.
If we are writing queries by hand, this query can be trivially
improved to:
SELECT * FROM bugs WHERE bug_status IN ('NEW', 'ACCEPTED', 'REOPENED')
AND assigned_to = 53922
UNION
SELECT * FROM bugs WHERE bug_status IN ('NEW', 'ACCEPTED', 'REOPENED')
AND reporter = 53922
Same results, but it takes 0.03 seconds on my development database.
Big win!
To use this in Bugzilla, the trick is to know when to transform
statements with ORs into statements with UNIONs and then how to do so.
To do it, I modified Search.pm to create what I call a "predicate
tree." It's a tree of ANDs or ORs. I then transform that tree into
a list of equivalent trees, run the query for each resulting
predicate, and then combine the results. I don't actually create
statements with UNION in them, but the idea is the same.
Unfortunately, there are a lot of traps in this approach. It's easy
to transform so-so queries into really bad queries. We have a very
simplistic heuristic for deciding when to transform a query or to
just let it go.
For us, the end result is that we add about 30 milliseconds to every
query on average, but the "worst" queries finish in a few seconds
rather than several minutes.
Furthermore, the main problem that we are solving this way manifests
itself really only when replicating to a shadow database. If a very
slow query is processing when a replication statement shows up, the
replication must wait for the slow query. Then, any new queries must
wait for the replication to finish. This can lead to a horrible logjam.
Summary:
The largest Bugzilla database, particularly when replicated to a
shadow database, can exhibit poor performance resulting from advanced
searches. The archetypal slow query is searching for all bugs that
have been assigned to, reported by, or CCed by a particular user.
Yahoo! has worked around the biggest bottlenecks by transforming some
of these slow queries into a series of faster queries.
However, this is really just an extreme example of the overall
scalability problems that pervade Bugzilla (at least up to 2.22.0).
Many sections of code interact with the database in ways that become
very expensive as the database grows large. (I don't intend that as
a criticism of Bugzilla in any way; I can only hope that our code
would perform as well if someone uses it on a database one hundred
times larger than ours.)
Caveat:
If you are trying to squeeze more performance from your Bugzilla,
there are probably other ways to do it. This approach is crafted for
a very specific set of circumstances.
What's Next:
First, I need to generalize the "predicate tree" notion into a CPAN
module. (I will be stuck in Virginia and Pittsburgh in December/
January, so I am hopeful that this will happen no later than that time.)
Second, we need to get our Bugzilla a little bit closer to current
Bugzilla so that we can make meaningful patches.
Then we can start to roll out our changes to Search.pm. Because most
Bugzilla databases are pretty small, it's probably not going to have
any noticeable effect on the performance of anyone's Bugzilla, unless
you have at least 500K rows in your bugs table.
Note: If you *do* have 500K rows in your bugs table and would like to
discuss previewing some of our changes, please contact me directly.
More information about the developers
mailing list