Custom fields schema

Fri Jan 28 19:56:37 UTC 2005

Myk Melez <myk at mozilla.org> wrote:
>And I disagree that most people have approved of your design.  I count 
>only Joel Peshkin, Maxwell Kanat-Alexander, Shane H. W. Travis, 
>Christopher Hicks as having expressed a clear opinion in support of your 
>solution in this thread, while Gervase Markham and Vlad Dascalu both 
>seem to oppose it, and John Fisher, Kevin Benton, Bradley Baetz, and 
>Nick Barnes have not expressed a clear position either way.

As I said, "most of the people who have stated an opinion".

>Custom fields aren't very different from standard fields, and a number 
>of our standard fields will become custom fields or reuse custom fields 
>code once it's done.  But even if they weren't similar, there is plenty 
>of precedent for them, as installations have been adding custom fields 
>for years, often with real columns.

Sure, but that's just because it's the quickest way to hack on additional
fields.  No one who only wanted a few more fields would seriously consider
going to the lengths I have to design and implement a new field
architecture.  This precedent is in no way an argument for FAC over FAD.

>My design uses real columns to represent fields, accounting for sparsity 
>variance by putting dense columns into the bugs table (like the standard 
>"severity" field, which lives in the bugs.bug_severity column) and 
>sparse columns into their own tables (like the standard "duplicate of 
>bug #"  field, which  lives in the duplicates.dupe_of column).

First of all, for installations at all similar to Transmeta's, sparse fields
are the norm, dense fields the exception:

mysql> create temporary table bugcount
    -> select product_id, count(*) as num_bugs
    -> from bugs
    -> group by product_id;
Query OK, 9 rows affected (0.01 sec)
Records: 9  Duplicates: 0  Warnings: 0

mysql> create temporary table fieldcount
    -> select product_id, count(*) as num_fields
    -> from cf_membership
    -> group by product_id;
Query OK, 9 rows affected (0.00 sec)
Records: 9  Duplicates: 0  Warnings: 0

mysql> select a.product_id, a.num_bugs, b.num_fields
    -> from bugcount as a, fieldcount as b
    -> where a.product_id = b.product_id;
+------------+----------+------------+
| product_id | num_bugs | num_fields |
+------------+----------+------------+
|          2 |     1892 |         60 |
|          4 |      129 |          2 |
|          5 |        3 |          3 |
|         12 |      198 |         42 |
|         13 |      578 |         28 |
|         14 |      220 |         14 |
|         15 |       52 |         28 |
|         16 |       34 |          6 |
|         17 |        6 |          2 |
+------------+----------+------------+
9 rows in set (0.00 sec)

There is minimal sharing of fields between products:

mysql> select field_id, count(*) as num_products
    -> from cf_membership
    -> group by field_id
    -> having count(*) >= 2;
+----------+--------------+
| field_id | num_products |
+----------+--------------+
|       64 |            2 |
|      310 |            2 |
+----------+--------------+
2 rows in set (0.00 sec)

A few more fields could conceivably have been shared, but it was more
convenient to import all products independently from our old Teamshare
database.  We use *none* of the standard fields version, rep_platform,
priority, resolution, severity, file_loc, or op_sys.

So, our most heavily-populated product has only 61% of the bugs and 32% of
the fields.  How should an administrator categorize this field or that as
"sparse" or "dense"?  What if he didn't know in advance the relative
populations of the products?

Furthermore, the classification of a field as sparse or dense can change
over time.  Consider the case of an installation that starts off with three
products, all of which share all custom fields.  Per your design, the
administrator makes them "dense" fields, stored as columns in BUGS.  Over
time seven more products are added, none of which require any of the
original fields.  If all of the products have comparable numbers of bugs and
fields, the original fields' density shrinks from 100% to 30%.  What then?

My design handles all such scenarios equally well, and requires neither
prescience nor hundreds of judgment calls on the part of the administrator.

>>And, again, your proposal
>>implements custom fields very differently from the way standard fields are
>>implemented, anyway.

>Actually my proposal implements custom fields as columns within the bugs 
>table or within their own table, just as Bugzilla does today with 
>standard fields.

What current standard fields live in their own tables?  I just wrote a quick
script to list all two-column tables for which the name of one column is
"bug_id", and found only BUG_GROUP_MAP, CC, and KEYWORDS.  All of these are
multi-valued "fields" that couldn't be easily made to live within BUGS
anyway.

>>Can you refer me to any
>>systems that use your design?

>Bugzilla, Bonsai, and Despot, to name only some Mozilla apps.  But of 
>course now I'm talking about the ER modeling approach outlined above, 
>not some one-table-per-field approach which I have never advocated.

You have, as a possibility:

]Of course, we don't have to put all of these columns into the bugs
]table.  We can do that, but we can also put each one into its own table
](so that bugs->custom field is a 1->0|1 relationship, and the database
]uses no more storage for the custom field than necessary), all of them
]into one other table, or some combination of these three, depending on
]what makes the most sense.

Also, your test program employed this type of schema, so I hope you can
understand my mistake.

--Sean