Custom fields schema
Myk Melez
myk at mozilla.org
Thu Jan 27 13:54:20 UTC 2005
Sean McAfee wrote:
>Myk:
>
>
>>*sigh*, ok, I'll do some testing next week when I get back from
>>vacation, although I really think the burden of proof should be on you,
>>given that you're the one suggesting we overturn thirty years of RDBMS
>>and relational database theory, design, and practical usage,
>>
>>
>
>OK, I take exception to this. Am I really such a maverick? Most of the
>people who have stated an opinion on the subject have expressed approval
>of my design. Can we all be as naive as you say?
>
>
I didn't say you or your supporters were naive, nor do I think you are.
I only said that your suggestion is contrary to proven general and
Bugzilla-specific database design principles, and thus the burden is on
you to prove its superiority.
And I disagree that most people have approved of your design. I count
only Joel Peshkin, Maxwell Kanat-Alexander, Shane H. W. Travis,
Christopher Hicks as having expressed a clear opinion in support of your
solution in this thread, while Gervase Markham and Vlad Dascalu both
seem to oppose it, and John Fisher, Kevin Benton, Bradley Baetz, and
Nick Barnes have not expressed a clear position either way.
But even if all of those people supported your position, the sample size
is too small for it to prove that your solution is preferable in the
face of credible contrary evidence from the field. We're a small group,
and we could all be incorrect.
>Custom fields are a completely new kind of beast. There is no precedent for
>them in Bugzilla's development history. (Not in the core distribution,
>anyway; some partial solutions are attached to bug 91037. I don't think
>that's what you're talking about, though.) There's no a priori reason to
>apply past Bugzilla development techniques to them.
>
>
Custom fields aren't very different from standard fields, and a number
of our standard fields will become custom fields or reuse custom fields
code once it's done. But even if they weren't similar, there is plenty
of precedent for them, as installations have been adding custom fields
for years, often with real columns.
>What is this "column metaphor"? Your design treats custom fields very
>differently than standard fields, applying more of a "table metaphor".
>
>
My design uses real columns to represent fields, accounting for sparsity
variance by putting dense columns into the bugs table (like the standard
"severity" field, which lives in the bugs.bug_severity column) and
sparse columns into their own tables (like the standard "duplicate of
bug #" field, which lives in the duplicates.dupe_of column). In both
cases, however, each custom field is represented by its own unique
database column, so the term "column metaphor" is apropo.
>>They're modifiable via SQL statements just as
>>easily as the data within them is. And while Bugzilla doesn't modify
>>its schema very much today, there's nothing inherently more dangerous
>>about it doing so.
>>
>>
>
>But much less elegant. To paraphrase Einstein, I think the schema ought to
>be as simple as possible, but no simpler. Transmeta's Bugzilla installation
>has 187 custom fields. A schema with in excess of 250 tables is not simple.
>Imagine trying to manage a schema of that size with a visual tool!
>
>
Elegance, in this case, truly seems to be in the eye of the beholder.
To my mind, FAC is more elegant because it uses the database system as
it was designed to be used, and that makes it simpler (although not too
simple to be useful). And while it might be difficult to manage a
schema with 250 tables using certain visual tools (f.e. an ER
diagrammer), it wouldn't be with others (f.e. a list representation of
tables). Plus, at least it's possible to visualize FAC with standard
visual tools, while it's not possible for FAD. And visual overload in
ER diagrammers can be overcome by limiting the scope of tables visualized.
>Later, [Myk] responding to Christopher Hicks:
>
>
>
>>That doesn't mean we should store all meta-data as data. We should use
>>the right tool for the job, as we have already done with standard
>>fields, for which we rightly use columns.
>>
>>
>
>Yes, that is right, because all bugs share the same standard fields. That
>condition is violated by custom fields.
>
Actually some standard fields are used only by certain products on some
installations, and some installations never use certain standard
fields. That hasn't prevented those fields from serving Bugzilla well,
so it is no reason to throw that storage model out (although it's worth
tweaking it to store sparse fields in separate tables and converting
frequently unused fields to custom fields).
>And, again, your proposal
>implements custom fields very differently from the way standard fields are
>implemented, anyway.
>
>
Actually my proposal implements custom fields as columns within the bugs
table or within their own table, just as Bugzilla does today with
standard fields.
>Querying against N custom fields results in joins against N tables in your
>scheme. In mine, it results in joins against T(N), the number of distinct
>datatypes among those N fields. The total number of joins among those
>T(N) tables is N including repeated joins, but I suspect that it is still
>cheaper to access fewer tables.
>
>
My tests, which include the queries you designed to showcase the
performance of your approach, indicate otherwise.
>Consider also simply retrieving custom field data. For a bug with twenty
>custom short string fields, your design would require SELECTs against
>twenty different tables; mine requires only one.
>
>
This isn't strictly accurate, since under my proposal only sparse fields
will live in separate tables. But even in a worst-case scenario,
retrieving data for a single bug is insignificantly expensive under both
designs.
>I haven't analyzed your tests in detail yet--it's been a hassle getting
>MySQL 4 to peacefully coexist with my previous MySQL 3 system. (By the way,
>if my design ignores thirty years of database theory, as you assert, why
>does yours require a recent version of MySQL to best it?)
>
>
First of all, I said your design contradicts thirty years of database
design theory, not performance optimization principles. As I said from
the beginning, performance isn't my only consideration when choosing a
design. Nevertheless, I expect a design conformant with standard
database design principles to perform better, too, and my tests show
that it does, even on MySQL 3.x (see below).
Second of all, my design does not require a recent version of MySQL. I
suggested MySQL 4.x+ because 3.x contains an inefficient fulltext
indexer, so the fulltext indexes take too long to create on tables of
the size in the test. But you can run my tests on MySQL 3.x without
creating fulltext indexes, and the results are the same: FAC wins in
most cases, and with bigger margins.
(I've attached new versions of construct-tables.pl and
test-performance.pl which don't create/use fulltext indexes by default
(specify "--fulltext" on the command line to turn them back on) and work
with MySQL 3.x, along with some test results from the machine "myk".)
Third of all, MySQL 3.x search results are relatively unimportant,
because MySQL AB no longer recommends 3.x except in special cases (they
now recommend 4.1, two generations newer than 3.x), Bugzilla will
require MySQL 4.x in the near future (probably before a custom fields
implementation lands), and 4.x is already preferable for Bugzilla today
due to stability and fulltext performance.
>Can you describe exactly what was wrong with my test that it went from being
>3-4 times better to being nearly an order of magnitude worse? I frankly
>find that hard to believe.
>
>
I cannot. I took the exact queries you ran, fixed a number of syntax
errors in them that prevented them from running at all on my machines,
plugged them into an automated script that runs each query six times
(discarding the first result and averaging the rest), and flagged them
SQL_NO_CACHE to prevent the cache from skewing the results (but this
would not have affected comparisons between your tests and mine, since
MySQL version 3.x doesn't have a query cache).
I ran the script on two different machines (holepunch and megalon) and
reported their results. I subsequently ran the script on a third
machine (myk) and got similar results. I have since run the tests
against MySQL version 3.x on myk (skipping the fulltext index tests) and
got similar results (attached).
So I cannot explain the variance, but I'm confident in the reliability
of my testing code and results, and they're more apparently automated
and have been tested on more machines, so I think it's your results
which are errant. In any case, I'm happy to provide all comers with the
scripts for duplicating my tests, and I look forward to any additional
results (and tests, and refinements of tests) anyone else can provide.
>I posted to comp.databases yesterday seeking advice regarding the merits of
>our two designs. The subject of the thread is "Best database schema for
>object fields". To date, the only poster to offer substantial criticism has
>stated "They both stink", but he did provide the useful information that the
>model I implemented has a name, EAV, or "Entity-Attribute-Value".
>
http://groups-beta.google.com/group/comp.databases/browse_frm/thread/aa5eca674b5a2073/a38a196ace5ef6b5#a38a196ace5ef6b5
Actually Celko said that both models are EAV, but he bases that claim on
an error in your explanation of FAC which I've corrected in a followup.
In reality, FAC is a standard ER modeling approach.
>Armed
>with that knowledge, I was able to Google several articles on the subject.
>A few were highly critical of EAV, but most were more balanced, listing
>advantages and disadvantages, and describing in what situations it's
>appropriate.
>
I did a similar search and found that EAV modeling has applicability in
some specialized scientific and medical problem domains which experience
extremely numerous, highly variable, and very sparse fields. It is,
however, rarely used otherwise, while FAC, which models dense attributes
as bug table columns and sparse attributes as columns in separate
tables, employs the standard ER modeling approach which is widely used
by the majority of relational databases, especially those like Bugzilla.
>In all cases, though, the data model EAV was compared against
>was the classic all-columns-in-one-table approach; I could find no example
>of your one-table-per-field design. (The crotchety poster in comp.databases
>described both of our designs as variants of EAV, but I can't really see how
>that's true.) So, it's hard to accept your assertion that
>one-field-per-table is "what columns are for". Can you refer me to any
>systems that use your design?
>
>
Bugzilla, Bonsai, and Despot, to name only some Mozilla apps. But of
course now I'm talking about the ER modeling approach outlined above,
not some one-table-per-field approach which I have never advocated.
>If I'm not mistaken, the long-term plan is to migrate standard fields to
>custom fields, so short-term discrepancies are not really relevant.
>
>
Actually the long-term plan is to migrate some but not all standard
fields to custom fields (some fields cannot be custom because their
processing is too specialized, while others are common to [virtually]
all bug tracking and so should be standard), and which fields are
standard/custom can and will change over time (in both directions), so
discrepancies between the two will remain relevant in all terms.
>> Per my tests and standard database design theory, real columns are
>> much faster than data columns.
>>
>>
>
>Again, I find this hard to believe. I suspect either some flaw in your test
>program, or some unfair advantage in the limited nature of the tests.
>
>
Perhaps. I doubt the flaw is in my test program, since it uses the same
function to run and time all queries, whether FAC or FAD. But the tests
it runs are indeed pretty limited (consisting only of your tests and a
few of my own), so it's quite possible for them to be unfairly
advantageous--in either direction. I welcome any improvement in them,
but I still think the onus should be on you to prove EAV superiority,
not on me to disprove it, given its nonstandard approach, and especially
given documentation on the web about its niche value for certain extreme
applications.
-myk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: construct-tables.pl
Type: application/x-perl
Size: 5595 bytes
Desc: not available
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment.pl>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-performance.pl
Type: application/x-perl
Size: 17870 bytes
Desc: not available
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment-0001.pl>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20050127/7085d09a/attachment-0001.html>
More information about the developers
mailing list