Custom fields schema

Thu Jan 27 13:54:20 UTC 2005

Sean McAfee wrote:

>Myk:
>  
>
>>*sigh*, ok, I'll do some testing next week when I get back from 
>>vacation, although I really think the burden of proof should be on you, 
>>given that you're the one suggesting we overturn thirty years of RDBMS 
>>and relational database theory, design, and practical usage,
>>    
>>
>
>OK, I take exception to this.  Am I really such a maverick?  Most of the
>people who have stated an opinion on the subject have expressed approval
>of my design.  Can we all be as naive as you say?
>  
>
I didn't say you or your supporters were naive, nor do I think you are.  
I only said that your suggestion is contrary to proven general and 
Bugzilla-specific database design principles, and thus the burden is on 
you to prove its superiority.

And I disagree that most people have approved of your design.  I count 
only Joel Peshkin, Maxwell Kanat-Alexander, Shane H. W. Travis, 
Christopher Hicks as having expressed a clear opinion in support of your 
solution in this thread, while Gervase Markham and Vlad Dascalu both 
seem to oppose it, and John Fisher, Kevin Benton, Bradley Baetz, and 
Nick Barnes have not expressed a clear position either way.

But even if all of those people supported your position, the sample size 
is too small for it to prove that your solution is preferable in the 
face of credible contrary evidence from the field.  We're a small group, 
and we could all be incorrect.

>Custom fields are a completely new kind of beast.  There is no precedent for
>them in Bugzilla's development history.  (Not in the core distribution,
>anyway; some partial solutions are attached to bug 91037.  I don't think
>that's what you're talking about, though.)  There's no a priori reason to
>apply past Bugzilla development techniques to them.
>  
>
Custom fields aren't very different from standard fields, and a number 
of our standard fields will become custom fields or reuse custom fields 
code once it's done.  But even if they weren't similar, there is plenty 
of precedent for them, as installations have been adding custom fields 
for years, often with real columns.

>What is this "column metaphor"?  Your design treats custom fields very
>differently than standard fields, applying more of a "table metaphor".
>  
>
My design uses real columns to represent fields, accounting for sparsity 
variance by putting dense columns into the bugs table (like the standard 
"severity" field, which lives in the bugs.bug_severity column) and 
sparse columns into their own tables (like the standard "duplicate of 
bug #"  field, which  lives in the duplicates.dupe_of column).  In both 
cases, however, each custom field is represented by its own unique 
database column, so the term "column metaphor" is apropo.

>>They're modifiable via SQL statements just as 
>>easily as the data within them is.  And while Bugzilla doesn't modify 
>>its schema very much today, there's nothing inherently more dangerous 
>>about it doing so.
>>    
>>
>
>But much less elegant.  To paraphrase Einstein, I think the schema ought to
>be as simple as possible, but no simpler.  Transmeta's Bugzilla installation
>has 187 custom fields.  A schema with in excess of 250 tables is not simple.
>Imagine trying to manage a schema of that size with a visual tool!
>  
>
Elegance, in this case, truly seems to be in the eye of the beholder.  
To my mind, FAC is more elegant because it uses the database system as 
it was designed to be used, and that makes it simpler (although not too 
simple to be useful).  And while it might be difficult to manage a 
schema with 250 tables using certain visual tools (f.e. an ER 
diagrammer), it wouldn't be with others (f.e. a list representation of 
tables).  Plus, at least it's possible to visualize FAC with standard 
visual tools, while it's not possible for FAD.  And visual overload in 
ER diagrammers can be overcome by limiting the scope of tables visualized.

>Later, [Myk] responding to Christopher Hicks:
>
>  
>
>>That doesn't mean we should store all meta-data as data.  We should use 
>>the right tool for the job, as we have already done with standard 
>>fields, for which we rightly use columns.
>>    
>>
>
>Yes, that is right, because all bugs share the same standard fields.  That
>condition is violated by custom fields.
>
Actually some standard fields are used only by certain products on some 
installations, and some installations never use certain standard 
fields.  That hasn't prevented those fields from serving Bugzilla well, 
so it is no reason to throw that storage model out (although it's worth 
tweaking it to store sparse fields in separate tables and converting 
frequently unused fields to custom fields).

>And, again, your proposal
>implements custom fields very differently from the way standard fields are
>implemented, anyway.
>  
>
Actually my proposal implements custom fields as columns within the bugs 
table or within their own table, just as Bugzilla does today with 
standard fields.

>Querying against N custom fields results in joins against N tables in your
>scheme.  In mine, it results in joins against T(N), the number of distinct
>datatypes among those N fields.  The total number of joins among those
>T(N) tables is N including repeated joins, but I suspect that it is still
>cheaper to access fewer tables.
>  
>
My tests, which include the queries you designed to showcase the 
performance of your approach, indicate otherwise.

>Consider also simply retrieving custom field data.  For a bug with twenty
>custom short string fields, your design would require SELECTs against
>twenty different tables; mine requires only one.
>  
>
This isn't strictly accurate, since under my proposal only sparse fields 
will live in separate tables.  But even in a worst-case scenario, 
retrieving data for a single bug is insignificantly expensive under both 
designs.

>I haven't analyzed your tests in detail yet--it's been a hassle getting
>MySQL 4 to peacefully coexist with my previous MySQL 3 system.  (By the way,
>if my design ignores thirty years of database theory, as you assert, why
>does yours require a recent version of MySQL to best it?)
>  
>
First of all, I said your design contradicts thirty years of database 
design theory, not performance optimization principles.  As I said from 
the beginning, performance isn't my only consideration when choosing a 
design.  Nevertheless, I expect a design conformant with standard 
database design principles to perform better, too, and my tests show 
that it does, even on MySQL 3.x (see below).

Second of all, my design does not require a recent version of MySQL.  I 
suggested MySQL 4.x+ because 3.x contains an inefficient fulltext 
indexer, so the fulltext indexes take too long to create on tables of 
the size in the test.  But you can run my tests on MySQL 3.x without 
creating fulltext indexes, and the results are the same: FAC wins in 
most cases, and with bigger margins.

(I've attached new versions of construct-tables.pl and 
test-performance.pl which don't create/use fulltext indexes by default 
(specify "--fulltext" on the command line to turn them back on) and work 
with MySQL 3.x, along with some test results from the machine "myk".)

Third of all, MySQL 3.x search results are relatively unimportant, 
because MySQL AB no longer recommends 3.x except in special cases (they 
now recommend 4.1, two generations newer than 3.x), Bugzilla will 
require MySQL 4.x in the near future (probably before a custom fields 
implementation lands), and 4.x is already preferable for Bugzilla today 
due to stability and fulltext performance.

>Can you describe exactly what was wrong with my test that it went from being
>3-4 times better to being nearly an order of magnitude worse?  I frankly
>find that hard to believe.
>  
>
I cannot.  I took the exact queries you ran, fixed a number of syntax 
errors in them that prevented them from running at all on my machines, 
plugged them into an automated script that runs each query six times 
(discarding the first result and averaging the rest), and flagged them 
SQL_NO_CACHE to prevent the cache from skewing the results (but this 
would not have affected comparisons between your tests and mine, since 
MySQL version 3.x doesn't have a query cache).

I ran the script on two different machines (holepunch and megalon) and 
reported their results.  I subsequently ran the script on a third 
machine (myk) and got similar results.  I have since run the tests 
against MySQL version 3.x on myk (skipping the fulltext index tests) and 
got similar results (attached).

So I cannot explain the variance, but I'm confident in the reliability 
of my testing code and results, and they're more apparently automated 
and have been tested on more machines, so I think it's your results 
which are errant.  In any case, I'm happy to provide all comers with the 
scripts for duplicating my tests, and I look forward to any additional 
results (and tests, and refinements of tests) anyone else can provide.

>I posted to comp.databases yesterday seeking advice regarding the merits of
>our two designs.  The subject of the thread is "Best database schema for
>object fields".  To date, the only poster to offer substantial criticism has
>stated "They both stink", but he did provide the useful information that the
>model I implemented has a name, EAV, or "Entity-Attribute-Value".
>
http://groups-beta.google.com/group/comp.databases/browse_frm/thread/aa5eca674b5a2073/a38a196ace5ef6b5#a38a196ace5ef6b5

Actually Celko said that both models are EAV, but he bases that claim on 
an error in your explanation of FAC which I've corrected in a followup.  
In reality, FAC is a standard ER modeling approach.

>Armed
>with that knowledge, I was able to Google several articles on the subject.
>A few were highly critical of EAV, but most were more balanced, listing
>advantages and disadvantages, and describing in what situations it's
>appropriate.
>
I did a similar search and found that EAV modeling has applicability in 
some specialized scientific and medical problem domains which experience 
extremely numerous, highly variable, and very sparse fields.  It is, 
however, rarely used otherwise, while FAC, which models dense attributes 
as bug table columns and sparse attributes as columns in separate 
tables, employs the standard ER modeling approach which is widely used 
by the majority of relational databases, especially those like Bugzilla.

>In all cases, though, the data model EAV was compared against
>was the classic all-columns-in-one-table approach; I could find no example
>of your one-table-per-field design.  (The crotchety poster in comp.databases
>described both of our designs as variants of EAV, but I can't really see how
>that's true.)  So, it's hard to accept your assertion that
>one-field-per-table is "what columns are for".  Can you refer me to any
>systems that use your design?
>  
>
Bugzilla, Bonsai, and Despot, to name only some Mozilla apps.  But of 
course now I'm talking about the ER modeling approach outlined above, 
not some one-table-per-field approach which I have never advocated.

>If I'm not mistaken, the long-term plan is to migrate standard fields to
>custom fields, so short-term discrepancies are not really relevant.
>  
>
Actually the long-term plan is to migrate some but not all standard 
fields to custom fields (some fields cannot be custom because their 
processing is too specialized, while others are common to [virtually] 
all bug tracking and so should be standard), and which fields are 
standard/custom can and will change over time (in both directions), so 
discrepancies between the two will remain relevant in all terms.

>>     Per my tests and standard database design theory, real columns are
>>     much faster than data columns.
>>    
>>
>
>Again, I find this hard to believe.  I suspect either some flaw in your test
>program, or some unfair advantage in the limited nature of the tests.
>  
>
Perhaps.  I doubt the flaw is in my test program, since it uses the same 
function to run and time all queries, whether FAC or FAD.  But the tests 
it runs are indeed pretty limited (consisting only of your tests and a 
few of my own), so it's quite possible for them to be unfairly 
advantageous--in either direction.  I welcome any improvement in them, 
but I still think the onus should be on you to prove EAV superiority, 
not on me to disprove it, given its nonstandard approach, and especially 
given documentation on the web about its niche value for certain extreme 
applications.

-myk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: construct-tables.pl
Type: application/x-perl
Size: 5595 bytes
Desc: not available
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment.pl>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-performance.pl
Type: application/x-perl
Size: 17870 bytes
Desc: not available
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment-0001.pl>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment-0001.html>