Custom fields schema

Thu Dec 16 02:50:41 UTC 2004

Myk Melez <myk at mozilla.org> wrote:
>Sean McAfee wrote:
>>To kick things off (I hope), here are some tables from my custom fields
>>schema, shorn of some of the niftier additions that have accrued over time.

>The basic issue with this approach is that it replicates infrastructure 
>that MySQL and other databases already provide for us and that we use 
>for non-custom fields.

>MySQL also provides us tools for querying and modifying the structure of 
>tables in the database so that we can build tools to manage the columns, 
>and its search and other algorithms are designed with the assumption 
>that databases are built using its architecture for structuring and 
>relating data.

What are these tools?

>So I'm loathe to replicate this architecture.  I think we're better off 
>making custom fields be real columns in the database, just like standard 
>fields, and creating independent value list and mapping tables for 
>select and multi-select custom fields, just as we do for standard fields 
>of those types.

The standard fields of those types are of the MySQL-specific ENUM type,
aren't they?  I had thought that greater platform-independence was an
ongoing goal.  Or has this changed in recent releases?  My information may
be out of date.

>Besides making custom fields work like standard fields, simplifying 
>their integration with existing code and the migration of fields from 
>standard to custom (and vice versa), this approach minimizes the 
>side-effect risk associated with building independent field management 
>on top of MySQL's existing infrastructure and the development cost of 
>maintaining two ways of accessing and manipulating fields.

There are significant efficiency concerns with a fields-as-columns
approach.

Suppose you have a largeish Bugzilla installation with, oh, say 100,000
bugs and 200 custom fields across twenty products.  Suppose further that you
want to run a simple query against one particular field: find all bugs where
the short string field "foo_field" contains the substring "meta".  In my
scheme, you run these two queries:

SELECT field_id
FROM custom_fields
WHERE field_name = 'foo_field';

Then, using the returned field_id (say, 99):

SELECT bug_id
FROM cf_shortstring
WHERE field_id = 99
  AND INSTR(value, 'meta');

The first query is very fast, thanks to the index on FIELD_NAME, and would
be fast anyway with only 200 rows to check.  The second one quickly targets
only those bugs in products that have foo_field as a member, thanks to the
index on FIELD_ID.  (I suppose the queries could be combined into one.)

If custom fields were "real" columns, you'd do something like this:

SELECT bug_id
FROM bugs
WHERE INSTR(cf_foo_field, 'meta');

This query has to scan *every bug in the installation*; there's no way to
restrict it only to bugs that actually have a foo_field.  BUGS is a huge,
sparse table, with 90% of its columns NULL in any given row, on average.
Say goodbye to casual "SELECT * FROM BUGS WHERE..." queries at the mysql
client command line, too.

Speaking of NULL, you'd have to give up the capability to have NULL custom
field values, since you couldn't do this:

SELECT bug_id
FROM bugs
WHERE cf_foo_field IS NULL;

My company's prior bug tracking system, TeamTrack, stored field data in huge
sparse tables, and its speed was/is atrocious.  I didn't have to think very
hard in the early days to know I didn't want to go the same route.

As an example of another, less important efficiency issue, suppose you want
get all of a single bug's custom field data.  You'd either do this:

SELECT *
FROM bugs
WHERE bug_id = 99;

This would fetch all column data, most of which would be NULL and irrelevant
in the hypothetical situation described above.  The alternative would be to
construct a list of known-to-be-relevant column names, and do this:

SELECT field_1, field_2, field_3, field_4, field_5, field_6,
       field_7, field_8, field_9, field_10, field_11, field_12,
       field_13, field_14, field_15, field_16, field_17, field_18,
       field_19, field_20
FROM bugs
WHERE bug_id = 99;

Yuck!  Both solutions are inelegant, which experience has taught me is
usually an indication of bad design.

Fields-as-columns may make life easier for the programmer, but it's the
needs of the users that should be coming first.

--Sean