Custom fields schema

Sun Dec 19 17:02:32 UTC 2004

Sean McAfee wrote:

>Myk Melez <myk at mozilla.org> wrote:
>
>  
>
>>MySQL also provides us tools for querying and modifying the structure of 
>>tables in the database so that we can build tools to manage the columns, 
>>and its search and other algorithms are designed with the assumption 
>>that databases are built using its architecture for structuring and 
>>relating data.
>>    
>>
>
>What are these tools?
>  
>

They're the MySQL data definition and utility statements.

>  
>
>>So I'm loathe to replicate this architecture.  I think we're better off 
>>making custom fields be real columns in the database, just like standard 
>>fields, and creating independent value list and mapping tables for 
>>select and multi-select custom fields, just as we do for standard fields 
>>of those types.
>>    
>>
>
>The standard fields of those types are of the MySQL-specific ENUM type,
>aren't they?  I had thought that greater platform-independence was an
>ongoing goal.  Or has this changed in recent releases?  My information may
>be out of date.
>  
>
Nope, you're not out of date, we are indeed replacing enums with tables 
of values.  I'm referring to fields like product, component, and 
target_milestone as well as the fields currently using enums which are 
being converted to work like the aforementioned fields.

>>Besides making custom fields work like standard fields, simplifying 
>>their integration with existing code and the migration of fields from 
>>standard to custom (and vice versa), this approach minimizes the 
>>side-effect risk associated with building independent field management 
>>on top of MySQL's existing infrastructure and the development cost of 
>>maintaining two ways of accessing and manipulating fields.
>>    
>>
>
>There are significant efficiency concerns with a fields-as-columns
>approach.
>
>Suppose you have a largeish Bugzilla installation with, oh, say 100,000
>bugs and 200 custom fields across twenty products.  Suppose further that you
>want to run a simple query against one particular field: find all bugs where
>the short string field "foo_field" contains the substring "meta".  In my
>scheme, you run these two queries:
>
>SELECT field_id
>FROM custom_fields
>WHERE field_name = 'foo_field';
>
>Then, using the returned field_id (say, 99):
>
>SELECT bug_id
>FROM cf_shortstring
>WHERE field_id = 99
>  AND INSTR(value, 'meta');
>
>The first query is very fast, thanks to the index on FIELD_NAME, and would
>be fast anyway with only 200 rows to check.  The second one quickly targets
>only those bugs in products that have foo_field as a member, thanks to the
>index on FIELD_ID.  (I suppose the queries could be combined into one.)
>
>If custom fields were "real" columns, you'd do something like this:
>
>SELECT bug_id
>FROM bugs
>WHERE INSTR(cf_foo_field, 'meta');
>
>This query has to scan *every bug in the installation*; there's no way to
>restrict it only to bugs that actually have a foo_field.  BUGS is a huge,
>sparse table, with 90% of its columns NULL in any given row, on average.
>Say goodbye to casual "SELECT * FROM BUGS WHERE..." queries at the mysql
>client command line, too.
>  
>
Notwithstanding that we should use fulltext indexes for these kinds of 
searches, this works pretty well for an existing standard field on the 
production b.m.o server:
------------------------------------------------------------------------
mysql> select count(*) from bugs;
+----------+
| count(*) |
+----------+
|   274656 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from bugs where status_whiteboard != '';
+----------+
| count(*) |
+----------+
|    32931 |
+----------+
1 row in set (0.75 sec)

mysql> select bug_id from bugs where INSTR(status_whiteboard,'meta');
+--------+
| bug_id |
+--------+
|   4553 |
|   9410 |
|  12309 |
|  16029 |
|  23583 |
|  35154 |
|  38761 |
|  45349 |
|  51279 |
|  52577 |
|  56812 |
|  63871 |
|  63872 |
|  69533 |
|  80613 |
|  89424 |
|  90669 |
|  92763 |
|  93969 |
| 127034 |
| 127147 |
| 139820 |
| 150783 |
| 161891 |
| 190406 |
| 190545 |
+--------+
27 rows in set (0.79 sec)
------------------------------------------------------------------------
(one bug ID removed from the result set above because it's a 
confidential bug)

>Speaking of NULL, you'd have to give up the capability to have NULL custom
>field values, since you couldn't do this:
>
>SELECT bug_id
>FROM bugs
>WHERE cf_foo_field IS NULL;
>  
>
Why not?

Of course, we don't have to put all of these columns into the bugs 
table.  We can do that, but we can also put each one into its own table 
(so that bugs->custom field is a 1->0|1 relationship, and the database 
uses no more storage for the custom field than necessary), all of them 
into one other table, or some combination of these three, depending on 
what makes the most sense.

>My company's prior bug tracking system, TeamTrack, stored field data in huge
>sparse tables, and its speed was/is atrocious.  I didn't have to think very
>hard in the early days to know I didn't want to go the same route.
>  
>
Bugzilla is also somewhat sparse, but it's not a major performance 
problem, as far as I know.  In any case, relationship DB models are 
designed to deal with sparseness (in fact, that's part of the reason 
they were developed) by moving sparse data to a separate table, as we do 
already for some standard fields.

We can do that for sparse custom fields (and even sparse standard ones 
still in the bugs table) with the fields-as-columns approach while 
keeping non-sparse fields in the bugs table, improving performance for 
those fields (because we don't have to do joins or multiple queries), 
unlike with the fields-as-data approach, which always requires joins or 
multiple queries.

>As an example of another, less important efficiency issue, suppose you want
>get all of a single bug's custom field data.  You'd either do this:
>
>SELECT *
>FROM bugs
>WHERE bug_id = 99;
>
>This would fetch all column data, most of which would be NULL and irrelevant
>in the hypothetical situation described above.  The alternative would be to
>construct a list of known-to-be-relevant column names, and do this:
>
>SELECT field_1, field_2, field_3, field_4, field_5, field_6,
>       field_7, field_8, field_9, field_10, field_11, field_12,
>       field_13, field_14, field_15, field_16, field_17, field_18,
>       field_19, field_20
>FROM bugs
>WHERE bug_id = 99;
>
>Yuck!  Both solutions are inelegant, which experience has taught me is
>usually an indication of bad design.
>  
>
But we'd have a list of custom fields (probably generated from a simple 
query or a sub-query), so the latter solution would be more like:

SELECT $list_of_fields
FROM bugs
WHERE bug_id = 99;

... which is both simple and elegant.

>Fields-as-columns may make life easier for the programmer, but it's the
>needs of the users that should be coming first.
>  
>
Agreed, and that's why I'm suggesting it, not to make our lives easier, 
especially considering I think fields-as-columns is harder to develop.  
Users would be better served by an app living on top of a database which 
uses the optimum architecture for custom fields.  For my money, that's 
the relational model developed over the last several decades for the 
purpose, and which we already use for standard fields.

-myk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.bugzilla.org/pipermail/developers/attachments/20041219/0258f85d/attachment.html>