MySQL UTF-8 Conversion in Bugzilla

Fergus Sullivan fergus at yahoo-inc.com
Tue Nov 28 02:01:28 UTC 2006


Hi guys.

Long time listener here, first time caller...

Our Bugzilla usage here at Yahoo may help illuminate the need for a  
broader approach to this problem.  We've found that a simple  
conversion to UTF-8 may not be the answer.  A more complex conversion  
process may be needed to avoid dataloss.

As background, our Bugzilla instance is based on version 2.20 and has  
not had a default charset.  We have bugs logged by internal users in  
many locales, including multi-byte locales.  The text of these bugs  
can often be in an East Asian language.  These display fine because  
mysql doesn't really care what charset its bytes are in.  The browser  
default will ensure a consistent charset is used within each locale.

Among the existing bugs, we have bugs that may fall into any of the  
following categories:

== The simple bugs ==
   - bug is entirely in ascii
   - bug is entirely in utf-8
   - bug is in shift-jis, euc-jp, big5, gb2312 or any other 'native  
character set'.

All of these can be converted from charset to charset relatively  
easily, especially if we can work out what the source charset is.   
There are libraries for this.

== The tricky bugs ==
   - bug is mostly in ascii or utf-8 or any other single charset, but  
includes characters from another valid charset.  Why?  A bug could be  
logged in English but includes pasted characters from Japanese.
   - bug is mostly in ascii or utf-8 or any other single charset, but  
includes garbage characters.  Why?  A bug report might refer to a  
Chinese product, but include pasted characters that indicate some  
kind of corruption took place prior to display time.

== The Problem ==
We'd very much like to have our existing database start defaulting to  
UTF-8.  The problem is how can we convert existing reports to UTF-8  
and not lose information in existing bugs that shows problems  
relating to charsets?

== Problematic Solution ==
We considered the following.
   - have bugzilla default to utf-8 for all NEW bugs.
   - set a unicode bit on all new bugs.
   - If at showbug a unicode bit is set:  set this in the header's  
charset.

Flaw:
   - What about contexts where we need to display both a unicode bug  
and a non-unicode bug in the same page?

== Problems with email ==
We also need to enforce charset when bugs are submitted via the email  
gateway, and when notifications are sent.  Also, let's not forget  
whinemails.

Thoughts?

  /ferg

--
fergus sullivan | yahoo bugzilla admin | fergus at yahoo-inc.com | o.  
408.349.6807 |



On Sun 19-Nov-06, at 10:20 a, Frédéric Buclin wrote:

>> 	checksetup.pl will warn you and give you 60 seconds to stop it,  
>> before
>> it goes ahead and converts your database.
>
>
> As dataloss is something really critical, wouldn't it be safer to
> request the admin to type some character (such as [Y]es) to confirm  
> the
> DB conversion, and cancel it if the admin didn't do it within 60
> seconds? Of course, checksetup.pl should be able to get this answer  
> from
> the answer file too. I would feel much more confident doing it this  
> way.
>
> LpSolit
> -
> To view or change your list settings, click here:
> <http://bugzilla.org/cgi-bin/mj_wwwusr?user=fergus@yahoo-inc.com>





More information about the developers mailing list