<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Sean McAfee wrote:
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Myk:
</pre>
<blockquote type="cite">
<pre wrap="">*sigh*, ok, I'll do some testing next week when I get back from
vacation, although I really think the burden of proof should be on you,
given that you're the one suggesting we overturn thirty years of RDBMS
and relational database theory, design, and practical usage,
</pre>
</blockquote>
<pre wrap=""><!---->
OK, I take exception to this. Am I really such a maverick? Most of the
people who have stated an opinion on the subject have expressed approval
of my design. Can we all be as naive as you say?
</pre>
</blockquote>
I didn't say you or your supporters were naive, nor do I think you
are. I only said that your suggestion is contrary to proven general
and Bugzilla-specific database design principles, and thus the burden
is on you to prove its superiority.<br>
<br>
And I disagree that most people have approved of your design. I count
only Joel Peshkin, Maxwell Kanat-Alexander, Shane H. W. Travis,
Christopher Hicks as having expressed a clear opinion in support of
your solution in this thread, while Gervase Markham and Vlad Dascalu
both seem to oppose it, and John Fisher, Kevin Benton, Bradley Baetz,
and Nick Barnes have not expressed a clear position either way.<br>
<br>
But even if all of those people supported your position, the sample
size is too small for it to prove that your solution is preferable in
the face of credible contrary evidence from the field. We're a small
group, and we could all be incorrect.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Custom fields are a completely new kind of beast. There is no precedent for
them in Bugzilla's development history. (Not in the core distribution,
anyway; some partial solutions are attached to bug 91037. I don't think
that's what you're talking about, though.) There's no a priori reason to
apply past Bugzilla development techniques to them.
</pre>
</blockquote>
Custom fields aren't very different from standard fields, and a number
of our standard fields will become custom fields or reuse custom fields
code once it's done. But even if they weren't similar, there is plenty
of precedent for them, as installations have been adding custom fields
for years, often with real columns.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">What is this "column metaphor"? Your design treats custom fields very
differently than standard fields, applying more of a "table metaphor".
</pre>
</blockquote>
My design uses real columns to represent fields, accounting for
sparsity variance by putting dense columns into the bugs table (like
the standard "severity" field, which lives in the bugs.bug_severity
column) and sparse columns into their own tables (like the standard
"duplicate of bug #" field, which lives in the duplicates.dupe_of
column). In both cases, however, each custom field is represented by
its own unique database column, so the term "column metaphor" is apropo.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<blockquote type="cite">
<pre wrap="">They're modifiable via SQL statements just as
easily as the data within them is. And while Bugzilla doesn't modify
its schema very much today, there's nothing inherently more dangerous
about it doing so.
</pre>
</blockquote>
<pre wrap=""><!---->
But much less elegant. To paraphrase Einstein, I think the schema ought to
be as simple as possible, but no simpler. Transmeta's Bugzilla installation
has 187 custom fields. A schema with in excess of 250 tables is not simple.
Imagine trying to manage a schema of that size with a visual tool!
</pre>
</blockquote>
Elegance, in this case, truly seems to be in the eye of the beholder.
To my mind, FAC is more elegant because it uses the database system as
it was designed to be used, and that makes it simpler (although not too
simple to be useful). And while it might be difficult to manage a
schema with 250 tables using certain visual tools (f.e. an ER
diagrammer), it wouldn't be with others (f.e. a list representation of
tables). Plus, at least it's possible to visualize FAC with standard
visual tools, while it's not possible for FAD. And visual overload in
ER diagrammers can be overcome by limiting the scope of tables
visualized.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Later, [Myk] responding to Christopher Hicks:
</pre>
<blockquote type="cite">
<pre wrap="">That doesn't mean we should store all meta-data as data. We should use
the right tool for the job, as we have already done with standard
fields, for which we rightly use columns.
</pre>
</blockquote>
<pre wrap=""><!---->
Yes, that is right, because all bugs share the same standard fields. That
condition is violated by custom fields.</pre>
</blockquote>
Actually some standard fields are used only by certain products on some
installations, and some installations never use certain standard
fields. That hasn't prevented those fields from serving Bugzilla well,
so it is no reason to throw that storage model out (although it's worth
tweaking it to store sparse fields in separate tables and converting
frequently unused fields to custom fields).<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">And, again, your proposal
implements custom fields very differently from the way standard fields are
implemented, anyway.
</pre>
</blockquote>
Actually my proposal implements custom fields as columns within the
bugs table or within their own table, just as Bugzilla does today with
standard fields.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Querying against N custom fields results in joins against N tables in your
scheme. In mine, it results in joins against T(N), the number of distinct
datatypes among those N fields. The total number of joins among those
T(N) tables is N including repeated joins, but I suspect that it is still
cheaper to access fewer tables.
</pre>
</blockquote>
My tests, which include the queries you designed to showcase the
performance of your approach, indicate otherwise.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Consider also simply retrieving custom field data. For a bug with twenty
custom short string fields, your design would require SELECTs against
twenty different tables; mine requires only one.
</pre>
</blockquote>
This isn't strictly accurate, since under my proposal only sparse
fields will live in separate tables. But even in a worst-case
scenario, retrieving data for a single bug is insignificantly expensive
under both designs.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">I haven't analyzed your tests in detail yet--it's been a hassle getting
MySQL 4 to peacefully coexist with my previous MySQL 3 system. (By the way,
if my design ignores thirty years of database theory, as you assert, why
does yours require a recent version of MySQL to best it?)
</pre>
</blockquote>
First of all, I said your design contradicts thirty years of database
design theory, not performance optimization principles. As I said from
the beginning, performance isn't my only consideration when choosing a
design. Nevertheless, I expect a design conformant with standard
database design principles to perform better, too, and my tests show
that it does, even on MySQL 3.x (see below).<br>
<br>
Second of all, my design does not require a recent version of MySQL. I
suggested MySQL 4.x+ because 3.x contains an inefficient fulltext
indexer, so the fulltext indexes take too long to create on tables of
the size in the test. But you can run my tests on MySQL 3.x without
creating fulltext indexes, and the results are the same: FAC wins in
most cases, and with bigger margins.<br>
<br>
(I've attached new versions of construct-tables.pl and
test-performance.pl which don't create/use fulltext indexes by default
(specify "--fulltext" on the command line to turn them back on) and
work with MySQL 3.x, along with some test results from the machine
"myk".)<br>
<br>
Third of all, MySQL 3.x search results are relatively unimportant,
because MySQL AB no longer recommends 3.x except in special cases (they
now recommend 4.1, two generations newer than 3.x), Bugzilla will
require MySQL 4.x in the near future (probably before a custom fields
implementation lands), and 4.x is already preferable for Bugzilla today
due to stability and fulltext performance.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Can you describe exactly what was wrong with my test that it went from being
3-4 times better to being nearly an order of magnitude worse? I frankly
find that hard to believe.
</pre>
</blockquote>
I cannot. I took the exact queries you ran, fixed a number of syntax
errors in them that prevented them from running at all on my machines,
plugged them into an automated script that runs each query six times
(discarding the first result and averaging the rest), and flagged them
SQL_NO_CACHE to prevent the cache from skewing the results (but this
would not have affected comparisons between your tests and mine, since
MySQL version 3.x doesn't have a query cache).<br>
<br>
I ran the script on two different machines (holepunch and megalon) and
reported their results. I subsequently ran the script on a third
machine (myk) and got similar results. I have since run the tests
against MySQL version 3.x on myk (skipping the fulltext index tests)
and got similar results (attached).<br>
<br>
So I cannot explain the variance, but I'm confident in the reliability
of my testing code and results, and they're more apparently automated
and have been tested on more machines, so I think it's your results
which are errant. In any case, I'm happy to provide all comers with
the scripts for duplicating my tests, and I look forward to any
additional results (and tests, and refinements of tests) anyone else
can provide.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">I posted to comp.databases yesterday seeking advice regarding the merits of
our two designs. The subject of the thread is "Best database schema for
object fields". To date, the only poster to offer substantial criticism has
stated "They both stink", but he did provide the useful information that the
model I implemented has a name, EAV, or "Entity-Attribute-Value".</pre>
</blockquote>
<a class="moz-txt-link-freetext" href="http://groups-beta.google.com/group/comp.databases/browse_frm/thread/aa5eca674b5a2073/a38a196ace5ef6b5#a38a196ace5ef6b5">http://groups-beta.google.com/group/comp.databases/browse_frm/thread/aa5eca674b5a2073/a38a196ace5ef6b5#a38a196ace5ef6b5</a><br>
<br>
Actually Celko said that both models are EAV, but he bases that claim
on an error in your explanation of FAC which I've corrected in a
followup. In reality, FAC is a standard ER modeling approach.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">Armed
with that knowledge, I was able to Google several articles on the subject.
A few were highly critical of EAV, but most were more balanced, listing
advantages and disadvantages, and describing in what situations it's
appropriate.</pre>
</blockquote>
I did a similar search and found that EAV modeling has applicability in
some specialized scientific and medical problem domains which
experience extremely numerous, highly variable, and very sparse
fields. It is, however, rarely used otherwise, while FAC, which models
dense attributes as bug table columns and sparse attributes as columns
in separate tables, employs the standard ER modeling approach which is
widely used by the majority of relational databases, especially those
like Bugzilla.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">In all cases, though, the data model EAV was compared against
was the classic all-columns-in-one-table approach; I could find no example
of your one-table-per-field design. (The crotchety poster in comp.databases
described both of our designs as variants of EAV, but I can't really see how
that's true.) So, it's hard to accept your assertion that
one-field-per-table is "what columns are for". Can you refer me to any
systems that use your design?
</pre>
</blockquote>
Bugzilla, Bonsai, and Despot, to name only some Mozilla apps. But of
course now I'm talking about the ER modeling approach outlined above,
not some one-table-per-field approach which I have never advocated.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<pre wrap="">If I'm not mistaken, the long-term plan is to migrate standard fields to
custom fields, so short-term discrepancies are not really relevant.
</pre>
</blockquote>
Actually the long-term plan is to migrate some but not all standard
fields to custom fields (some fields cannot be custom because their
processing is too specialized, while others are common to [virtually]
all bug tracking and so should be standard), and which fields are
standard/custom can and will change over time (in both directions), so
discrepancies between the two will remain relevant in all terms.<br>
<br>
<blockquote cite="mid20050126183723.05ECDBC77@mail.schwag.org"
type="cite">
<blockquote type="cite">
<pre wrap=""> Per my tests and standard database design theory, real columns are
much faster than data columns.
</pre>
</blockquote>
<pre wrap=""><!---->
Again, I find this hard to believe. I suspect either some flaw in your test
program, or some unfair advantage in the limited nature of the tests.
</pre>
</blockquote>
Perhaps. I doubt the flaw is in my test program, since it uses the
same function to run and time all queries, whether FAC or FAD. But the
tests it runs are indeed pretty limited (consisting only of your tests
and a few of my own), so it's quite possible for them to be unfairly
advantageous--in either direction. I welcome any improvement in them,
but I still think the onus should be on you to prove EAV superiority,
not on me to disprove it, given its nonstandard approach, and
especially given documentation on the web about its niche value for
certain extreme applications.<br>
<br>
-myk<br>
<br>
</body>
</html>